Key Responsibilities:
1. Strategic Leadership & Consulting
Define and implement SRE strategies aligned with business and technology
objectives.
Act as a trusted advisor to executive leadership influencing reliability
observability and automation initiatives.
Collaborate with engineering cloud DevOps security and platform teams
to drive reliability and resilience roadmaps.
Conduct reliability assessments risk analysis and gap identification for
continuous service improvement.
Lead the adoption of SRE culture across the organization evangelizing
reliability engineering principles.
2. Site Reliability & Observability Architecture
Architect and implement scalable observability solutions including APM
logs traces and metrics (e.g. Prometheus Grafana Datadog New Relic
Splunk).
Develop a unified monitoring and alerting framework that integrates real-time
insights and automated response mechanisms.
Establish and refine SLOs SLIs and error budgets to enhance service
reliability.
Optimize incident management and root cause analysis using AI-driven
observability and predictive analytics.
3. Incident & Service Management
Define and implement best practices for incident response post-mortems
and problem resolution.
Improve MTTR (Mean Time to Repair) and MTTF (Mean Time to Failure)
through proactive automation and analytics-driven insights.
Senior Engineering Leader (SRE) 2
Develop robust escalation and alerting policies reducing noise and improving
signal-to-noise ratios in monitoring.
Drive RPA (Robotic Process Automation) and workflow automation to
eliminate repetitive manual operational tasks.
4. Toil Elimination & Automation
Identify and eliminate operational toil using self-healing infrastructure
runbooks automation and auto-remediation workflows.
Champion AI/ML-based predictive analytics for anomaly detection capacity
planning and proactive incident prevention.
Develop CI/CD-driven operational automation for reducing manual
interventions in deployments and rollbacks.
Build and lead initiatives in AI Ops ChatOps and ITSM automation to
streamline support operations.
5. Talent Development & Technical Leadership
Mentor coach and grow high-performing SRE teams fostering a culture of
innovation and continuous learning.
Drive SRE training programs workshops and certifications to upskill
engineers on modern reliability practices.
Establish and promote career development frameworks for SRE engineers at
different levels.
Cultivate an environment of psychological safety collaboration and shared
responsibility for reliability.
6. Governance Compliance & Cost Optimization
Ensure governance and compliance with regulatory requirements (e.g. ISO
27001 SOC 2 NIST ITIL).
Optimize cloud cost efficiency through effective capacity planning
autoscaling and FinOps principles.
Senior Engineering Leader (SRE) 3
Define policies for resilience engineering chaos engineering experiments
and disaster recovery planning.
Work closely with InfoSec teams to implement security monitoring and threat
detection capabilities.
Required Qualifications & Experience:
Technical Expertise
15 years of experience in Site Reliability Engineering DevOps or
Infrastructure Engineering.
Expertise in monitoring & observability tools like Datadog Prometheus
Grafana New Relic AppDynamics Splunk OpenTelemetry.
Hands-on experience with incident management service management (ITIL)
and automation platforms.
Strong background in toil elimination and workflow automation using RPA AI
Ops and event-driven automation.
Proficiency in programming (Python Go Java or similar) for scripting and
automation.
Experience with Kubernetes Service Mesh (Istio Linkerd) and cloud-native
architectures.
Deep understanding of SLOs SLIs error budgets and reliability engineering
principles.
Expertise in Cloud Platforms (AWS Azure GCP) including serverless
containerization and networking.
Strong understanding of predictive analytics AI/ML for anomaly detection
and self-healing systems.
Leadership & Consulting Skills
Proven experience leading large-scale SRE teams and driving enterprise-wide
reliability initiatives.
Ability to influence executive stakeholders and drive strategic decision-
making.
Senior Engineering Leader (SRE) 4
Experience mentoring coaching and developing engineering talent.
Exceptional problem-solving and incident management skills with a data-
driven approach.
Strong communication documentation and storytelling abilities to convey
reliability insights.
Preferred Qualifications:
Certification in SRE (Google SRE CRE) AWS/Azure/GCP Architect ITIL or
TOGAF.
Experience with AI-driven IT operations (AIOps) and generative AI-based
observability.
Hands-on expertise in workflow orchestration tools (Airflow Argo Workflows
Camunda ServiceNow).
Familiarity with SRE in highly regulated industries such as finance healthcare
or telecom.
Strong background in distributed systems microservices and API reliability
engineering.
predictive analytics,site reliability engineering,service management (itil),governance & compliance,kubernetes,automation,workflow automation,cost optimization,cloud,cloud platforms (aws, azure, gcp),programming (python, go, java),rpa,aiops,devops,infrastructure engineering,service mesh (istio, linkerd),ai/ml,reliability,incident management,toil elimination,automation platforms,monitoring & observability tools