Contractor Senior Specialist Cloud SRE Kubernetes

Mumbai - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Job Title: Senior Site Reliability Engineer (SRE) Kubernetes Automation & Observability
Experience: 810 years
Job Location: Mumbai
Role Level: Senior Individual Contributor

Job Description:
We are seeking a Senior Site Reliability Engineer (SRE) to support Customer AWS/ Azure platform modernization and reliability initiatives. This role focuses on migrating legacy worker processes to Kubernetes strengthening Infrastructure as Code (IaC) and CI/CD pipelines and driving strong observability and operational excellence.
The SRE will work closely with Customer engineering teams to embed reliability automation and monitoring into the platform while ensuring high availability scalability and predictable deployments.

Key Responsibilities:
Kubernetes & Platform Modernization
Lead the containerization and migration of existing worker processes to Kubernetes.
Design Kubernetes-native deployment patterns including health checks autoscaling and failure recovery.
Define resource requests/limits rollout strategies and operational standards for workloads.
Define implement and maintain Service Level Indicators (SLIs) Service Level Objectives (SLOs) and error budgets for critical services.

Reliability Engineering & SRE Practices:
Define implement and maintain SLIs SLOs and error budgets for critical services.
Continuously monitor SLO compliance and drive improvements based on error budget usage.
Participate in architecture reviews focused on high availability scalability and fault tolerance.
Apply resilience patterns such as retries circuit breakers rate limiting and graceful degradation.

Incident Problem & Change Management:
Act as a Tier-3 escalation point for production and deployment issues.
Lead incident response blameless postmortems and Root Cause Analysis (RCA).
Maintain and improve runbooks escalation paths and on-call readiness.
Track and improve key metrics such as MTTR deployment success rate and incident frequency.

Automation & Infrastructure as Code:
Develop and maintain Infrastructure as Code using Terraform CloudFormation and AWS CDK.
Build and enhance CI/CD pipelines supporting rolling blue/green and canary deployments.
Automate Dev-to-Staging redeployments with validation rollback and promotion mechanisms.
Reduce operational toil through automation and self-healing workflows.

Monitoring Observability & Logging (SRE Tools Focus):
Design and operate end-to-end observability covering metrics logs and traces.
Hands-on experience with:
oNew Relic / Datadog for APM distributed tracing and SLO tracking
oPrometheus for metrics collection
oGrafana for dashboards and SRE scorecards
oGraylog / ELK for centralized logging and root cause analysis
Ensure alerts are SLO-driven actionable and noise-free.
Build customer-facing dashboards to demonstrate reliability and deployment health.

Cloud Infrastructure & Platform Reliability:
Provision and operate cloud infrastructure primarily on AWS.
Manage compute networking load balancers IAM backups patching and DR readiness.
Optimize performance and cost through autoscaling rightsizing and capacity planning.
Support reliability of data platforms such as MongoDB Elasticsearch/OpenSearch MySQL (RDS) and DocumentDB.

Required Qualifications:
810 years of experience in SRE DevOps or Cloud Engineering roles.
Strong Linux fundamentals with scripting (Bash Python or equivalent).
Hands-on experience with:
oKubernetes & containerized workloads
oTerraform / CloudFormation / AWS CDK
oCI/CD pipelines and deployment automation
oObservability tools: New Relic Datadog Prometheus Grafana ELK/Graylog
Strong understanding of distributed systems and production operations.
AWS experience (certification preferred).

Nice-to-Have Skills:
Kubernetes certifications (CKA / CKAD).
Experience with canary blue/green or progressive delivery models.
Exposure to chaos engineering or fault injection.
Experience supporting customer-facing production platforms.

Required Experience:

Senior IC

DescriptionJob Title: Senior Site Reliability Engineer (SRE) Kubernetes Automation & ObservabilityExperience: 810 yearsJob Location: MumbaiRole Level: Senior Individual ContributorJob Description:We are seeking a Senior Site Reliability Engineer (SRE) to support Customer AWS/ Azure platform moderni...

Description

Required Experience:

Senior IC

Key Skills

Computer Hardware
Mac Os
Manufacturing & Controls
Root cause Analysis
Windows
Customer Support
Remote Access Software
Operating Systems
Encryption
Remedy
Chemistry
Cerner

Apply Now

About Company

Datavail Infotech

Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leadi ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click