Job Description:
ITSM / Site Reliability Engineering (SRE) Manager with 12 years of experience to lead service operations reliability engineering and continuous improvement initiatives across complex IT environments. The role focuses on delivering highly reliable scalable and resilient systems while ensuring strong ITSM governance and service excellence.
Key Responsibilities
- Lead ITSM processes including Incident Problem Change Release and Availability Management
- Drive SRE practices such as SLIs SLOs error budgets automation and toil reduction
- Own production stability availability and performance for critical applications and platforms
- Manage major incident response root cause analysis and post-incident reviews
- Collaborate with engineering infrastructure and security teams to embed reliability into system design
- Implement and mature monitoring alerting observability and automation frameworks
- Govern service delivery metrics SLA dashboards and continuous service improvement (CSI) initiatives
- Lead mentor and develop ITSM and SRE teams
- Ensure compliance with ITIL standards security and regulatory requirements
Job Description: ITSM / Site Reliability Engineering (SRE) Manager with 12 years of experience to lead service operations reliability engineering and continuous improvement initiatives across complex IT environments. The role focuses on delivering highly reliable scalable and resilient systems while...
Job Description:
ITSM / Site Reliability Engineering (SRE) Manager with 12 years of experience to lead service operations reliability engineering and continuous improvement initiatives across complex IT environments. The role focuses on delivering highly reliable scalable and resilient systems while ensuring strong ITSM governance and service excellence.
Key Responsibilities
- Lead ITSM processes including Incident Problem Change Release and Availability Management
- Drive SRE practices such as SLIs SLOs error budgets automation and toil reduction
- Own production stability availability and performance for critical applications and platforms
- Manage major incident response root cause analysis and post-incident reviews
- Collaborate with engineering infrastructure and security teams to embed reliability into system design
- Implement and mature monitoring alerting observability and automation frameworks
- Govern service delivery metrics SLA dashboards and continuous service improvement (CSI) initiatives
- Lead mentor and develop ITSM and SRE teams
- Ensure compliance with ITIL standards security and regulatory requirements
View more
View less