Site Reliability Engineer

Pune - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Role Overview

We are seeking a Senior Site Reliability Engineer with strong experience in building and maintaining scalable resilient systems. The ideal candidate will have hands-on expertise in cloud-native technologies infrastructure as code observability and automation with a focus on Google Cloud Platform (GCP).

Key Responsibilities

Ensure the stability and reliability of cloud-native applications deployed on GCP containerized with Docker and orchestrated via Kubernetes.

Define implement and monitor SLOs SLAs and SLIs to measure system performance and user experience.

Automate infrastructure provisioning using Terraform and manage Kubernetes configurations with Kustomize and Helm.

Develop and maintain monitoring and alerting systems using Datadog and GCP-native tools.

Conduct incident analysis and postmortems to drive continuous improvement.

Collaborate with development teams to integrate reliability practices into CI/CD pipelines using GitHub Actions.

Manage and troubleshoot database systems particularly PostgreSQL and Cassandra.

Apply networking knowledge and Linux system administration skills to troubleshoot and optimize system connectivity and performance.

Qualifications :

Education

Bachelors or Masters degree in Computer Science Software Engineering or equivalent practical experience.

Work Experience & Skills

5 years of experience in Site Reliability Engineering.

Proven experience designing and operating elastic resilient systems in cloud environments.

Strong understanding of GCP Kubernetes and container orchestration.

Proficiency in infrastructure as code and configuration management tools (Terraform Helm Kustomize).

Experience with monitoring and observability tools (Datadog GCP Monitoring).

Solid scripting skills in bash and familiarity with automation frameworks.

Experience with CI/CD pipelines especially using GitHub Actions.

Familiarity with networking fundamentals and troubleshooting.

Strong coding skills and ability to develop reliability-focused tooling.

Excellent communication skills in English (written and spoken).

Other Requirements

Strong problem-solving skills and a process-oriented mindset.

Ability to work independently and collaboratively in a fast-paced environment.

Passion for clean code automation and continuous improvement.

Nice-to-Have

Familiarity with monitoring tools (e.g. DataDog Prometheus GCP Monitoring).

Experience working in Agile/Scrum teams.

Remote Work :

Employment Type :

Full-time

Role Overview We are seeking a Senior Site Reliability Engineer with strong experience in building and maintaining scalable resilient systems. The ideal candidate will have hands-on expertise in cloud-native technologies infrastructure as code observability and automation with a focus on Google Clou...

Role Overview

Key Responsibilities

Ensure the stability and reliability of cloud-native applications deployed on GCP containerized with Docker and orchestrated via Kubernetes.

Define implement and monitor SLOs SLAs and SLIs to measure system performance and user experience.

Automate infrastructure provisioning using Terraform and manage Kubernetes configurations with Kustomize and Helm.

Develop and maintain monitoring and alerting systems using Datadog and GCP-native tools.

Conduct incident analysis and postmortems to drive continuous improvement.

Collaborate with development teams to integrate reliability practices into CI/CD pipelines using GitHub Actions.

Manage and troubleshoot database systems particularly PostgreSQL and Cassandra.

Apply networking knowledge and Linux system administration skills to troubleshoot and optimize system connectivity and performance.

Qualifications :

Education

Bachelors or Masters degree in Computer Science Software Engineering or equivalent practical experience.

Work Experience & Skills

5 years of experience in Site Reliability Engineering.

Proven experience designing and operating elastic resilient systems in cloud environments.

Strong understanding of GCP Kubernetes and container orchestration.

Proficiency in infrastructure as code and configuration management tools (Terraform Helm Kustomize).

Experience with monitoring and observability tools (Datadog GCP Monitoring).

Solid scripting skills in bash and familiarity with automation frameworks.

Experience with CI/CD pipelines especially using GitHub Actions.

Familiarity with networking fundamentals and troubleshooting.

Strong coding skills and ability to develop reliability-focused tooling.

Excellent communication skills in English (written and spoken).

Other Requirements

Strong problem-solving skills and a process-oriented mindset.

Ability to work independently and collaboratively in a fast-paced environment.

Passion for clean code automation and continuous improvement.

Nice-to-Have

Familiarity with monitoring tools (e.g. DataDog Prometheus GCP Monitoring).

Experience working in Agile/Scrum teams.

Remote Work :

Employment Type :

Full-time

Key Skills

Kubernetes
FMEA
Continuous Improvement
Elasticsearch
Go
Root cause Analysis
Maximo
CMMS
Maintenance
Mechanical Engineering
Manufacturing
Troubleshooting

Apply Now

About Company

METROMAKRO

METRO is a leading international wholesale company with food and non-food assortments that specialises in serving the needs of hotels, restaurants and caterers (HoReCa) as well as independent traders. Around the world, METRO has 15 million customers who can choose whether to shop in o ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click