Senior Manager – Site Reliability Engineering (SRE)
Woonsocket, RI - USA
Job Summary
Job Description
Role Summary
We are seeking a Senior Manager of Site Reliability Engineering (SRE) to help drive the activation structure and scaling of SRE practices across the Financial Services & Innovation (FS&I) organization.
This role is responsible for establishing operational discipline driving adoption of SRE standards and aligning application teams Production Support Engineering (PSE) and platform teams to a consistent reliability model.
The ideal candidate brings a combination of technical depth organizational leadership and execution rigor with proven experience implementing SRE practices in complex enterprise environments.
Key Responsibilities
SRE Activation & Operating Model
- Drive adoption of the SRE operating model across application teams
- Establish clarity in roles between:
- SRE
- Production Support Engineering (PSE)
- Application teams
- Ensure SRE practices are embedded into the development lifecycle not treated as post-production activities
Reliability Standards & Governance
- Define and enforce:
- SLIs SLOs and Error Budgets
- Production readiness criteria
- Reliability best practices
- Lead SLO adoption and compliance reviews across the organization
- Establish governance frameworks to ensure consistent application of standards
Cross-Team Coordination & Enablement
- Partner with:
- Application product teams
- Production Support Engineering (MG team)
- Platform / Infrastructure / Observability teams
- Drive alignment and reduce friction between engineering and operations
- Ensure clear handoffs escalation models and operational ownership
Observability & Monitoring Strategy
- Lead adoption of centralized observability standards across:
- Metrics
- Logging
- Tracing
- Align tooling (AppDynamics Splunk Prometheus etc.)
- Ensure monitoring and alerting are SLO-driven and actionable not noise-based
Incident Management & Continuous Improvement
- Partner with PSE to strengthen:
- Incident management processes
- RCA (Root Cause Analysis) standards
- Drive identification of patterns and systemic issues
- Ensure learnings translate into engineering improvements and automation
Automation & Reliability Engineering
- Identify opportunities to:
- Reduce manual operational work
- Improve system resilience
- Enable self-healing capabilities
- Promote a culture of engineering over reaction
Reporting & Organizational Insight
- Define and track reliability metrics across FS&I
- Build reporting that provides visibility into:
- System health
- Incident trends
- SLO performance
- Translate technical data into actionable business insights
Required Qualifications
- 10 years in engineering operations or SRE roles
- 5 years leading SRE platform or reliability-focused teams
- Proven experience implementing SRE practices at scale (SLIs SLOs error budgets)
- Strong background in cloud environments (AWS Azure GCP)
- Hands-on experience with observability tools (Splunk AppDynamics Prometheus etc.)
- Experience in incident management and production operations at scale
- Ability to operate effectively in high-pressure and complex enterprise environments
Preferred Qualifications
- Experience driving organizational transformation (not just technical implementation)
- Strong understanding of CI/CD DevOps and automation practices
- Experience working in regulated or large enterprise environments
- Familiarity with AIOps or advanced automation strategies
Key Success Indicators
- Increased adoption of SLOs and reliability standards
- Reduction in high-severity incidents over time
- Improved MTTR and operational efficiency
- Increased adoption of standardized observability practices
- Reduction in reactive ticket-driven work across teams
- Clear alignment between SRE PSE and application teams
Core Competencies
- Strategic thinking with strong execution focus
- Ability to drive alignment across multiple teams and stakeholders
- Strong communication and influence skills
- Bias toward structure clarity and accountability
- Ability to operate with urgency and discipline in complex environments