DescriptionSr. SRE Manager
Were looking for a seasoned hands-on Senior Site Reliability Engineering (SRE) Manager to lead a high-impact team responsible for infrastructure deployment and observability across all environments of our highly customized global SaaS platform. This role blends operational excellence automation strategy and cross-functional collaboration to ensure system reliability performance and visibilitywhile mentoring a team that thrives on ownership and continuous improvement.
Key Responsibilities
Deployment & Infrastructure Operations
- Own and manage cloud infrastructure (and code deployment) processes across all environments.
- Partner with DevOps to consume CI pipelines and ensure seamless reliable CD execution.
- Oversee infrastructure provisioning and environment readiness using IaC and automation tools.
- Ensure system reliability and compliance through OS patching and server upgrades.
- Define and manage server and storage backup strategies to meet customer RPO/RTO targets.
Observability & Monitoring
- Lead configuration and optimization of monitoring tools (New Relic Uptime Robot PagerDuty).
- Drive creation of dashboards alerts and automated reports for system health and performance.
- Ensure visibility into system & application behavior across all customer environments.
Leadership & Strategy
- Build and mentor a high-performing SRE team with a focus on ownership accountability and continuous improvement.
- Collaborate with Engineering DevOps DBA and support teams to align reliability goals with product and customer needs.
- Develop and enforce best practices for incident response postmortems and change management.
- Serve as an escalation point for complex technical issues and customer concerns related to cloud infrastructure and services.
Operational Excellence
- Monitor and report on key reliability metrics including system uptime application performance alert volumes and severity-1 incidents.
- Identify and eliminate toil through automation and process refinement.
- Champion a culture of resilience transparency and proactive problem-solving.
Required Skills & Experience
- 6 years in SRE DevOps or infrastructure engineering roles supporting SaaS environments; 2 years in leadership capacity.
- Strong experience with cloud platforms (AWS and Azure) containers (Kubernetes Docker) and IaC tools (Terraform CloudFormation).
- Deep understanding of CI/CD pipelines and deployment orchestration.
- Hands-on experience with observability platforms and telemetry pipelines.
- Excellent communication and stakeholder management skills.
Nice to Have
- Experience supporting single-tenant SaaS platform.
- Familiarity with ITIL or ticket-based deployment workflows.
- Background in performance tuning and capacity planning.
Required Experience:
Manager