Job Title: Senior Site Reliability Engineer (SRE) Automation & Observability
Experience: 810 years
Education: Any Degree
Location: Mumbai
Role Level: Senior Individual Contributor (Customer-facing)
Job Description:
We are seeking a Senior Site Reliability Engineer (SRE) to own and continuously improve the reliability availability scalability and performance of business-critical services across multi-cloud environments (AWS Azure GCP).
This role combines strong SRE fundamentals automation engineering and observability expertise with customer leadership. You will work closely with customer engineering teams to embed reliability into application design drive automation lead incident response and demonstrate measurable SRE outcomes through dashboards and metrics.
Key Responsibilities:
Reliability Engineering & SRE Practices
Define implement and maintain Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets for critical services.
Continuously monitor SLO compliance and drive improvements based on error budget consumption.
Participate in architecture reviews focused on high availability disaster recovery scalability and fault tolerance.
Incident Problem & Change Management:
Lead incident response acting as the Tier-3 escalation point for SRE and operations teams.
Drive blameless postmortems Root Cause Analysis (RCA) and ensure corrective and preventive actions are implemented.
Define and maintain incident response runbooks escalation paths and on-call processes.
Track and improve key reliability metrics including MTTR incident frequency and change failure rate.
Automation & Infrastructure as Code:
Automate infrastructure provisioning and operational workflows using Terraform CloudFormation and AWS CDK.
Build and maintain CI/CD pipelines supporting canary deployments blue/green strategies and automated rollbacks.
Implement event-driven automation and auto-remediation using AWS Lambda Step Functions or Azure Functions.
Continuously identify and eliminate operational toil through automation and self-healing systems.
Monitoring Observability & Logging:
Design implement and operate end-to-end observability platforms covering metrics logs and traces.
Hands-on experience with:
oNew Relic / Datadog for APM distributed tracing and SLO tracking
oPrometheus for metrics collection
oGrafana for dashboards and SRE scorecards
oGraylog / ELK for centralized logging and RCA
Ensure alerts are SLO-driven actionable and noise-free.
Build customer-facing dashboards to clearly demonstrate SRE service outcomes.
Cloud Infrastructure & Platform Reliability:
Provision and manage cloud infrastructure across AWS Azure and/or GCP.
Operate compute storage networking load balancers VPNs and private connectivity.
Manage patching backups encryption IAM/RBAC and disaster recovery readiness.
Optimize performance and cost through rightsizing autoscaling and capacity planning.
Ensure reliability of data platforms such as MongoDB / MongoDB Atlas Elasticsearch / OpenSearch MySQL (RDS) and DocumentDB.
Customer Engagement & Mentorship:
Act as the primary technical contact for assigned customer accounts.
Lead reliability and observability discussions with customers and internal stakeholders.
Mentor mid-level and junior SREs conducting reliability-focused design and operational reviews.
Maintain high-quality documentation runbooks SOPs and operational playbooks.
Required Qualifications:
810 years of experience in SRE Cloud Engineering or Production Operations roles.
Strong OS fundamentals: Linux and Windows with scripting (Bash PowerShell).
Strong programming skills in Python Go or equivalent.
Proven hands-on experience with:
oInfrastructure as Code (Terraform CloudFormation CDK)
oCI/CD pipelines and deployment automation
oObservability tools (New Relic Datadog Prometheus Grafana Graylog ELK)
oDistributed systems at production scale
Cloud certifications (one or more):
oAWS (Associate or Professional)
oAzure (AZ-104 / Architect Expert)
oGCP (Professional Cloud Architect)
Cloud-agnostic certification such as Terraform Associate CKA or SRE Foundation.
Nice-to-Have Skills:
Experience with multi-cloud or hybrid architectures.
Exposure to cross-region or cross-cloud data replication.
Hands-on experience with chaos engineering or fault injection.
Knowledge of ITIL Agile or SRE maturity models.
Experience with serverless architectures (AWS Lambda Azure Functions).
Required Experience:
Senior IC
Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leadi ... View more