Job Title: Senior Associate Cloud SRE
Education: Any Graduate
Experience: 4 to 8 years
Location: Mumbai (Hybrid Model)
Employment Type: Full-time
Overview:
We are seeking a Site Reliability Engineer to deliver tier two cloud operations managed services support for AWS environments. This role combines advanced troubleshooting and operational excellence with proactive reliability engineering focusing on maintaining 24x7x365 service availability while continuously improving automation and operational efficiency.
Role Summary:
As a Site Reliability Engineer supporting AWS infrastructure you will manage complex operational challenges and escalations while implementing reliability best practices across production systems. You will work collaboratively with customer teams and senior engineers to ensure system stability automate operational workflows and maintain comprehensive observability. This is a delivery-focused role requiring both advanced technical execution and operational ownership.
Primary Responsibilities:
Tier 2 Cloud Operations & Managed Services
Provide 24x7x365 tier two support and escalation handling for AWS environments
Execute complex operational tasks including:
Patching and managing Amazon Machine Images (AMIs)
Creating and configuring EC2 instances and RDS databases
Managing IAM roles users and policies
Configuring S3 bucket policies and Access Control Lists (ACLs)
Opening and managing network routes
Restoring snapshots and database backups to lower environments
Increasing disk sizes and managing storage optimization
Implementing proper tagging for environment identification and cost allocation
Managing logs archiving and retention policies
Handle escalations from tier one support with deep technical analysis
Provide root cause analysis for complex incidents and recurring issues
Reliability & Incident Management:
Implement and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in collaboration with senior engineers and customer stakeholders
Lead tier two incident response performing advanced troubleshooting and resolution
Conduct thorough post-incident analysis with actionable remediation plans
Reduce reactive work by improving runbooks alert configurations and standard operating procedures
Apply reliability engineering best practices with oversight and review
Mentor tier one engineers during incident response
Automation & Infrastructure as Code:
Build and maintain CI/CD pipelines for infrastructure and application deployments
Automate complex operational tasks including patching backups and environment provisioning
Develop infrastructure automation using Terraform or equivalent IaC tools
Create sophisticated scripts and tooling to eliminate manual toil and improve operational efficiency
Follow established patterns and contribute continuous improvements
Document automation processes for knowledge sharing
Containerization & Deployment:
Deploy and operate containerized workloads using Docker on AWS services (ECS EKS or other managed container platforms)
Support container reliability through proper health checks autoscaling configurations and resource management
Implement safe deployment patterns (canary deployments blue/green deployments)
Troubleshoot complex containerization and orchestration issues
Follow and enhance established containerization standards
Observability & Performance:
Configure and maintain comprehensive monitoring logging and alerting systems
Leverage observability data to identify issues and lead root cause analysis
Contribute to performance tuning and cost optimization initiatives
Ensure proper instrumentation and telemetry across AWS environments
Identify patterns and trends to prevent future incidents
Build custom dashboards and reports for operational insights
Collaboration & Customer Engagement:
Work closely with customer development and operations teams to improve system operability
Participate in design reviews and reliability assessments
Communicate technical concepts tradeoffs and recommendations clearly to stakeholders
Provide regular operational updates and service reports
Act as technical liaison between customers and internal engineering teams
Required Qualifications:
Experience:
35 years of hands-on experience in DevOps SRE or production operations roles
Proven experience operating production systems in AWS environments
Demonstrated experience managing containerized applications in production
Experience delivering managed services or supporting customer-facing infrastructure
Track record of handling complex technical escalations
Technical Skills:
AWS Services: Strong working knowledge of EC2 RDS S3 IAM VPC CloudWatch and related services
Containerization: Hands-on experience with Docker and container orchestration platforms (ECS EKS or managed Kubernetes)
Infrastructure as Code: Proficiency with Terraform or equivalent tools
CI/CD: Experience building and maintaining automated deployment pipelines
Scripting/Programming: Proficiency in Python Go Bash or similar languages
Monitoring & Logging: Experience with observability tools (CloudWatch Datadog Splunk ELK or similar)
Version Control: Proficiency with Git and collaborative development workflows
Troubleshooting: Advanced diagnostic and problem-solving capabilities
Operational Capabilities:
Experience with 24x7 operations and tier two escalation support
Strong troubleshooting and root cause analysis skills
Understanding of networking concepts security best practices and compliance requirements
Familiarity with backup/restore procedures and disaster recovery planning
Ability to work under pressure during critical incidents
Preferred Qualifications:
AWS certifications (Solutions Architect Associate SysOps Administrator or DevOps Engineer Professional)
Experience with Kubernetes in production environments
Prior consulting or managed services provider experience
Multi-cloud experience (Azure AWS)
Experience with configuration management tools (Ansible Chef Puppet)
Knowledge of security and compliance frameworks (HIPAA SOC 2 PCI-DSS)
Cloud-agnostic certifications (Terraform Associate CKA or SRE Foundation)
Experience in healthcare finance or other regulated industries
Required Experience:
Senior IC
Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leadi ... View more