Senior Site Reliability Engineer

Pune - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer

Location: Pune - In Office

Experience: 48 Years

On-Call Rotation Required (24/7 Production Support)

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the reliability scalability security and performance of our production systems across multi-cloud environments (AWS GCP Azure). This role combines strong DevOps automation expertise with true SRE ownership including on-call participation incident management root cause analysis reliability engineering and proactive system improvements. The ideal candidate balances incident response and firefighting with long-term engineering improvements that reduce toil improve SLAs and strengthen system resilience.

Key Responsibilities

Response & On-Call Ownership

Participate in 24/7 on-call rotation for production systems

Rapidly diagnose mitigate and resolve high-severity incidents

Lead Root Cause Analysis (RCA) and post-mortem documentation

Implement corrective and preventive measures to avoid recurrence

Maintain SLAs/SLOs and reduce Mean Time to Recovery (MTTR)

Engineering & System Hardening

Design and implement reliability improvements to increase availability and reduce system fragility

Engineer solutions to eliminate repetitive operational work (toil reduction)

Improve redundancy failover strategies and disaster recovery planning

Track and improve SRE metrics (availability latency error rates capacity)

3. Infrastructure & Cloud Engineering (Multi-Cloud)

Manage and optimize infrastructure across:

AWS (EC2 S3 RDS IAM VPC CloudWatch)

Google Cloud Platform (GCP) (Compute Engine Cloud Storage Cloud SQL IAM VPC) (Having GCP is a plus)

Microsoft Azure (Virtual Machines Networking Storage Azure Monitor)

Administer and optimize Kubernetes clusters

Manage Helm deployments and containerized workloads

Implement Infrastructure as Code (Terraform preferred)

4. Monitoring Observability & Performance Optimization

Design symptom-based alerting (user-impact driven monitoring)

Implement observability using:

Prometheus

Grafana

Datadog

AWS CloudWatch

Azure Monitor

Analyze system bottlenecks and optimize performance

Improve logging and distributed tracing practices

5.Good to have-AI & Cloud-Native Workloads (Value Add)

Support deployment of AI services on Azure (Azure AI Services AI Foundry)

Assist in infrastructure for RAG (Retrieval-Augmented Generation) workloads

Ensure scalability and reliability of AI/ML systems in production

7. Security & Compliance

Apply cloud security best practices (IAM network segmentation secrets management)

Collaborate on vulnerability remediation

Support compliance requirements where applicable

Required Technical Skills

Core Engineering

Strong scripting/programming skills (Python Bash; Go is a plus)

Deep understanding of Linux systems and networking fundamentals

Experience working in production environments with high uptime requirements

Cloud & Infrastructure

Hands-on experience with at least one major cloud platform (AWS/GCP/Azure)

Kubernetes and container orchestration experience

Infrastructure as Code (Terraform preferred)

Git-based workflows (GitHub / GitLab / Azure Repos)

Monitoring & Observability

Experience with Prometheus Grafana Datadog or similar tools

Understanding of SLIs SLOs SLAs

Preferred Qualifications

Good to have in experience managing AI/ML workloads in cloud environments.

Familiarity with distributed systems architecture

Exposure to OpenSearch / ELK stack

Experience reducing operational toil through automation

Basic knowledge of C# (.NET environments) is a plus

What Were Looking For

Ownership mind-set not just task execution

Calm under pressure during incidents

Strong debugging and analytical thinking skills

Ability to balance immediate incident response with long-term engineering improvements

Collaborative approach with development teams

Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer Location: Pune - In Office Experience: 48 Years On-Call Rotation Required (24/7 Production Support) About the Role We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the re...