Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer
Location: Pune - In Office
Experience: 48 Years
On-Call Rotation Required (24/7 Production Support)
About the Role
We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the reliability scalability security and performance of our production systems across multi-cloud environments (AWS GCP Azure). This role combines strong DevOps automation expertise with true SRE ownership including on-call participation incident management root cause analysis reliability engineering and proactive system improvements. The ideal candidate balances incident response and firefighting with long-term engineering improvements that reduce toil improve SLAs and strengthen system resilience.
Key Responsibilities
Response & On-Call Ownership
Participate in 24/7 on-call rotation for production systems
Rapidly diagnose mitigate and resolve high-severity incidents
Lead Root Cause Analysis (RCA) and post-mortem documentation
Implement corrective and preventive measures to avoid recurrence
Maintain SLAs/SLOs and reduce Mean Time to Recovery (MTTR)
Engineering & System Hardening
Design and implement reliability improvements to increase availability and reduce system fragility
Engineer solutions to eliminate repetitive operational work (toil reduction)
Improve redundancy failover strategies and disaster recovery planning
Track and improve SRE metrics (availability latency error rates capacity)
3. Infrastructure & Cloud Engineering (Multi-Cloud)
Manage and optimize infrastructure across:
AWS (EC2 S3 RDS IAM VPC CloudWatch)
Google Cloud Platform (GCP) (Compute Engine Cloud Storage Cloud SQL IAM VPC) (Having GCP is a plus)
Microsoft Azure (Virtual Machines Networking Storage Azure Monitor)
Administer and optimize Kubernetes clusters
Manage Helm deployments and containerized workloads
Implement Infrastructure as Code (Terraform preferred)
4. Monitoring Observability & Performance Optimization
Design symptom-based alerting (user-impact driven monitoring)
Implement observability using:
Prometheus
Grafana
Datadog
AWS CloudWatch
Azure Monitor
Analyze system bottlenecks and optimize performance
Improve logging and distributed tracing practices
5.Good to have-AI & Cloud-Native Workloads (Value Add)
Support deployment of AI services on Azure (Azure AI Services AI Foundry)
Assist in infrastructure for RAG (Retrieval-Augmented Generation) workloads
Ensure scalability and reliability of AI/ML systems in production
7. Security & Compliance
Apply cloud security best practices (IAM network segmentation secrets management)
Collaborate on vulnerability remediation
Support compliance requirements where applicable
Required Technical Skills
Core Engineering
Strong scripting/programming skills (Python Bash; Go is a plus)
Deep understanding of Linux systems and networking fundamentals
Experience working in production environments with high uptime requirements
Cloud & Infrastructure
Hands-on experience with at least one major cloud platform (AWS/GCP/Azure)
Kubernetes and container orchestration experience
Infrastructure as Code (Terraform preferred)
Git-based workflows (GitHub / GitLab / Azure Repos)
Monitoring & Observability
Experience with Prometheus Grafana Datadog or similar tools
Understanding of SLIs SLOs SLAs
Preferred Qualifications
Good to have in experience managing AI/ML workloads in cloud environments.
Familiarity with distributed systems architecture
Exposure to OpenSearch / ELK stack
Experience reducing operational toil through automation
Basic knowledge of C# (.NET environments) is a plus
What Were Looking For
Ownership mind-set not just task execution
Calm under pressure during incidents
Strong debugging and analytical thinking skills
Ability to balance immediate incident response with long-term engineering improvements
Collaborative approach with development teams
Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer Location: Pune - In Office Experience: 48 Years On-Call Rotation Required (24/7 Production Support) About the Role We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the re...
Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer
Location: Pune - In Office
Experience: 48 Years
On-Call Rotation Required (24/7 Production Support)
About the Role
We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the reliability scalability security and performance of our production systems across multi-cloud environments (AWS GCP Azure). This role combines strong DevOps automation expertise with true SRE ownership including on-call participation incident management root cause analysis reliability engineering and proactive system improvements. The ideal candidate balances incident response and firefighting with long-term engineering improvements that reduce toil improve SLAs and strengthen system resilience.
Key Responsibilities
Response & On-Call Ownership
Participate in 24/7 on-call rotation for production systems
Rapidly diagnose mitigate and resolve high-severity incidents
Lead Root Cause Analysis (RCA) and post-mortem documentation
Implement corrective and preventive measures to avoid recurrence
Maintain SLAs/SLOs and reduce Mean Time to Recovery (MTTR)
Engineering & System Hardening
Design and implement reliability improvements to increase availability and reduce system fragility
Engineer solutions to eliminate repetitive operational work (toil reduction)
Improve redundancy failover strategies and disaster recovery planning
Track and improve SRE metrics (availability latency error rates capacity)
3. Infrastructure & Cloud Engineering (Multi-Cloud)
Manage and optimize infrastructure across:
AWS (EC2 S3 RDS IAM VPC CloudWatch)
Google Cloud Platform (GCP) (Compute Engine Cloud Storage Cloud SQL IAM VPC) (Having GCP is a plus)
Microsoft Azure (Virtual Machines Networking Storage Azure Monitor)
Administer and optimize Kubernetes clusters
Manage Helm deployments and containerized workloads
Implement Infrastructure as Code (Terraform preferred)
4. Monitoring Observability & Performance Optimization
Design symptom-based alerting (user-impact driven monitoring)
Implement observability using:
Prometheus
Grafana
Datadog
AWS CloudWatch
Azure Monitor
Analyze system bottlenecks and optimize performance
Improve logging and distributed tracing practices
5.Good to have-AI & Cloud-Native Workloads (Value Add)
Support deployment of AI services on Azure (Azure AI Services AI Foundry)
Assist in infrastructure for RAG (Retrieval-Augmented Generation) workloads
Ensure scalability and reliability of AI/ML systems in production
7. Security & Compliance
Apply cloud security best practices (IAM network segmentation secrets management)
Collaborate on vulnerability remediation
Support compliance requirements where applicable
Required Technical Skills
Core Engineering
Strong scripting/programming skills (Python Bash; Go is a plus)
Deep understanding of Linux systems and networking fundamentals
Experience working in production environments with high uptime requirements
Cloud & Infrastructure
Hands-on experience with at least one major cloud platform (AWS/GCP/Azure)
Kubernetes and container orchestration experience
Infrastructure as Code (Terraform preferred)
Git-based workflows (GitHub / GitLab / Azure Repos)
Monitoring & Observability
Experience with Prometheus Grafana Datadog or similar tools
Understanding of SLIs SLOs SLAs
Preferred Qualifications
Good to have in experience managing AI/ML workloads in cloud environments.
Familiarity with distributed systems architecture
Exposure to OpenSearch / ELK stack
Experience reducing operational toil through automation
Basic knowledge of C# (.NET environments) is a plus
What Were Looking For
Ownership mind-set not just task execution
Calm under pressure during incidents
Strong debugging and analytical thinking skills
Ability to balance immediate incident response with long-term engineering improvements
Collaborative approach with development teams
View more
View less