Job Summary
Mandatory Skills
Python Site Reliability Engineer Elk
Skill to Evaluate
Python Site Reliability Engineer ElkAWSGCPKubernetesDockerAnsiblepackerJenkinsSplunkCriblTerraformVectorsPrometheuslinuxhelmdatadog
Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability cloud-native infrastructure and large-scale distributed systems. This role is highly hands-on and focuses on designing building and operating reliable observable and scalable platforms running on Kubernetes with a strong preference for Google Cloud Platform (GCP) and AWS.
Roles & Responsibilities
Reliability & Operations
Observability (Core Focus)
Cloud & Platform Engineering
Automation & Tooling
Collaboration & Leadership
Python Site Reliability Engineer Elk
Skill to Evaluate
Python Site Reliability Engineer ElkAWSGCPKubernetesDockerAnsiblepackerJenkinsSplunkCriblTerraformVectorsPrometheuslinuxhelmdatadog
Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability cloud-native infrastructure and large-scale distributed systems. This role is highly hands-on and focuses on designing building and operating reliable observable and scalable platforms running on Kubernetes with a strong preference for Google Cloud Platform (GCP) and AWS.
Roles & Responsibilities
Reliability & Operations
- Design implement and maintain highly available and resilient systems in Kubernetes-based environments
- Define and enforce SLOs SLIs and error budgets
- Lead incident response RCA and postmortems
- Drive reliability improvements through automation
Observability (Core Focus)
- Architect and operate observability platforms for metrics logging tracing and alerting
- Work with Prometheus Alertmanager OpenTelemetry Grafana Loki / ELK / OpenSearch
- Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)
- Establish actionable alerting standards
Cloud & Platform Engineering
- Build and manage infrastructure on GCP (preferred) or AWS
- Operate Kubernetes clusters (GKE preferred)
- Deploy services using Helm
- Manage containerized workloads using Docker
Automation & Tooling
- Strong Python skills with emphasis on reliability automation and observability tooling
- Develop automation and tooling using Python
- Create internal reliability and monitoring tools
- Integrate CI/CD pipelines with observability and reliability checks
Collaboration & Leadership
- Mentor junior engineers
- Influence architecture decisions
- Collaborate across engineering teams