67171-1
Job Summary
Req ID
Job Title: Engineer
WorkLocation : CHENNAIPUNEBANGALORE
Skill Required: Digital : Site Reliability Engineering (SRE)
Experience Range: 6-8
Roles & Responsibilities
System Reliability & Availability
Design implement and maintain highly available and fault tolerant systems
Define and manage Service Level Objectives (SLOs) SLIs and SLAs
Proactively identify reliability risks and drive remediation
Participate in on call rotations and incident response
Incident Management
Lead and participate in production incident handling
Perform root cause analysis (RCA) and author post incident reviews
Drive corrective and preventive actions to avoid recurrence
Improve mean time to detect (MTTD) and mean time to recover (MTTR)
Automation & Engineering
Automate operational tasks to eliminate manual toil
Build and maintain tools for deployment monitoring alerting and scaling
Develop software to improve system reliability (not just scripts)
Monitoring & Observability
Implement monitoring logging and tracing solutions
Define actionable alerts based on SLOs (not symptom only alerts)
Continuously improve system observability
Infrastructure & Cloud Operations
Manage infrastructure using Infrastructure as Code (IaC) tools
Work with containers and orchestration platforms (e.g. Kubernetes)
Optimize cloud resource usage for reliability and cost
Collaboration
Partner with development teams to embed reliability into design
Review architecture changes for reliability scalability and resilience
Promote best practices and SRE culture across teams
Job Title: Engineer
WorkLocation : CHENNAIPUNEBANGALORE
Skill Required: Digital : Site Reliability Engineering (SRE)
Experience Range: 6-8
Roles & Responsibilities
System Reliability & Availability
Design implement and maintain highly available and fault tolerant systems
Define and manage Service Level Objectives (SLOs) SLIs and SLAs
Proactively identify reliability risks and drive remediation
Participate in on call rotations and incident response
Incident Management
Lead and participate in production incident handling
Perform root cause analysis (RCA) and author post incident reviews
Drive corrective and preventive actions to avoid recurrence
Improve mean time to detect (MTTD) and mean time to recover (MTTR)
Automation & Engineering
Automate operational tasks to eliminate manual toil
Build and maintain tools for deployment monitoring alerting and scaling
Develop software to improve system reliability (not just scripts)
Monitoring & Observability
Implement monitoring logging and tracing solutions
Define actionable alerts based on SLOs (not symptom only alerts)
Continuously improve system observability
Infrastructure & Cloud Operations
Manage infrastructure using Infrastructure as Code (IaC) tools
Work with containers and orchestration platforms (e.g. Kubernetes)
Optimize cloud resource usage for reliability and cost
Collaboration
Partner with development teams to embed reliability into design
Review architecture changes for reliability scalability and resilience
Promote best practices and SRE culture across teams