Senior Associate Cloud SRE

Datavail Infotech

Not Interested
Bookmark
Report This Job

profile Job Location:

Mumbai - India

profile Monthly Salary: Not Disclosed
Posted on: 22 hours ago
Vacancies: 1 Vacancy

Job Summary

Description

Job Title: Senior Associate Cloud SRE

Education: Any Graduate

Experience: 4 to 8 years

Location: Mumbai (Hybrid Model)

Employment Type: Full-time

Overview:

We are seeking a Site Reliability Engineer to deliver tier two cloud operations managed services support for AWS environments. This role combines advanced troubleshooting and operational excellence with proactive reliability engineering focusing on maintaining 24x7x365 service availability while continuously improving automation and operational efficiency.

Role Summary:

As a Site Reliability Engineer supporting AWS infrastructure you will manage complex operational challenges and escalations while implementing reliability best practices across production systems. You will work collaboratively with customer teams and senior engineers to ensure system stability automate operational workflows and maintain comprehensive observability. This is a delivery-focused role requiring both advanced technical execution and operational ownership.

Primary Responsibilities:

  • Tier 2 Cloud Operations & Managed Services

  • Provide 24x7x365 tier two support and escalation handling for AWS environments

  • Execute complex operational tasks including:

  • Patching and managing Amazon Machine Images (AMIs)

  • Creating and configuring EC2 instances and RDS databases

  • Managing IAM roles users and policies

  • Configuring S3 bucket policies and Access Control Lists (ACLs)

  • Opening and managing network routes

  • Restoring snapshots and database backups to lower environments

  • Increasing disk sizes and managing storage optimization

  • Implementing proper tagging for environment identification and cost allocation

  • Managing logs archiving and retention policies

  • Handle escalations from tier one support with deep technical analysis

  • Provide root cause analysis for complex incidents and recurring issues

Reliability & Incident Management:

  • Implement and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in collaboration with senior engineers and customer stakeholders

  • Lead tier two incident response performing advanced troubleshooting and resolution

  • Conduct thorough post-incident analysis with actionable remediation plans

  • Reduce reactive work by improving runbooks alert configurations and standard operating procedures

  • Apply reliability engineering best practices with oversight and review

  • Mentor tier one engineers during incident response

Automation & Infrastructure as Code:

  • Build and maintain CI/CD pipelines for infrastructure and application deployments

  • Automate complex operational tasks including patching backups and environment provisioning

  • Develop infrastructure automation using Terraform or equivalent IaC tools

  • Create sophisticated scripts and tooling to eliminate manual toil and improve operational efficiency

  • Follow established patterns and contribute continuous improvements

  • Document automation processes for knowledge sharing

Containerization & Deployment:

  • Deploy and operate containerized workloads using Docker on AWS services (ECS EKS or other managed container platforms)

  • Support container reliability through proper health checks autoscaling configurations and resource management

  • Implement safe deployment patterns (canary deployments blue/green deployments)

  • Troubleshoot complex containerization and orchestration issues

  • Follow and enhance established containerization standards

Observability & Performance:

  • Configure and maintain comprehensive monitoring logging and alerting systems

  • Leverage observability data to identify issues and lead root cause analysis

  • Contribute to performance tuning and cost optimization initiatives

  • Ensure proper instrumentation and telemetry across AWS environments

  • Identify patterns and trends to prevent future incidents

  • Build custom dashboards and reports for operational insights

Collaboration & Customer Engagement:

  • Work closely with customer development and operations teams to improve system operability

  • Participate in design reviews and reliability assessments

  • Communicate technical concepts tradeoffs and recommendations clearly to stakeholders

  • Provide regular operational updates and service reports

  • Act as technical liaison between customers and internal engineering teams

Required Qualifications:

Experience:

  • 35 years of hands-on experience in DevOps SRE or production operations roles

  • Proven experience operating production systems in AWS environments

  • Demonstrated experience managing containerized applications in production

  • Experience delivering managed services or supporting customer-facing infrastructure

  • Track record of handling complex technical escalations

Technical Skills:

  • AWS Services: Strong working knowledge of EC2 RDS S3 IAM VPC CloudWatch and related services

  • Containerization: Hands-on experience with Docker and container orchestration platforms (ECS EKS or managed Kubernetes)

  • Infrastructure as Code: Proficiency with Terraform or equivalent tools

  • CI/CD: Experience building and maintaining automated deployment pipelines

  • Scripting/Programming: Proficiency in Python Go Bash or similar languages

  • Monitoring & Logging: Experience with observability tools (CloudWatch Datadog Splunk ELK or similar)

  • Version Control: Proficiency with Git and collaborative development workflows

  • Troubleshooting: Advanced diagnostic and problem-solving capabilities

Operational Capabilities:

  • Experience with 24x7 operations and tier two escalation support

  • Strong troubleshooting and root cause analysis skills

  • Understanding of networking concepts security best practices and compliance requirements

  • Familiarity with backup/restore procedures and disaster recovery planning

  • Ability to work under pressure during critical incidents

Preferred Qualifications:

  • AWS certifications (Solutions Architect Associate SysOps Administrator or DevOps Engineer Professional)

  • Experience with Kubernetes in production environments

  • Prior consulting or managed services provider experience

  • Multi-cloud experience (Azure AWS)

  • Experience with configuration management tools (Ansible Chef Puppet)

  • Knowledge of security and compliance frameworks (HIPAA SOC 2 PCI-DSS)

  • Cloud-agnostic certifications (Terraform Associate CKA or SRE Foundation)

  • Experience in healthcare finance or other regulated industries




Required Experience:

Senior IC

DescriptionJob Title: Senior Associate Cloud SREEducation: Any GraduateExperience: 4 to 8 yearsLocation: Mumbai (Hybrid Model)Employment Type: Full-timeOverview:We are seeking a Site Reliability Engineer to deliver tier two cloud operations managed services support for AWS environments. This role co...
View more view more

Key Skills

  • Business Development
  • Organizational Management
  • GAAP
  • Accounting
  • Analysis Skills
  • Meeting Facilitation
  • Microsoft Powerpoint
  • Investment Banking
  • Microsoft Excel
  • Private equity
  • Financial Modeling
  • Writing Skills

About Company

Company Logo

Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leadi ... View more

View Profile View Profile