SRE

Trigent Software Private Limited

Job Location:

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Mandatory Skills
Python Site Reliability Engineer Elk
Skill to Evaluate
Python Site Reliability Engineer ElkAWSGCPKubernetesDockerAnsiblepackerJenkinsSplunkCriblTerraformVectorsPrometheuslinuxhelmdatadog

Job Description
We are looking for a Senior Site Reliability Engineer (SRE) with deep expertise in observability cloud-native infrastructure and large-scale distributed systems. This role is highly hands-on and focuses on designing building and operating reliable observable and scalable platforms running on Kubernetes with a strong preference for Google Cloud Platform (GCP) and AWS.

Roles & Responsibilities
Reliability & Operations

Design implement and maintain highly available and resilient systems in Kubernetes-based environments
Define and enforce SLOs SLIs and error budgets
Lead incident response RCA and postmortems
Drive reliability improvements through automation

Observability (Core Focus)

Architect and operate observability platforms for metrics logging tracing and alerting
Work with Prometheus Alertmanager OpenTelemetry Grafana Loki / ELK / OpenSearch
Implement cloud-native monitoring (GCP Cloud Monitoring & Logging preferred)
Establish actionable alerting standards

Cloud & Platform Engineering

Build and manage infrastructure on GCP (preferred) or AWS
Operate Kubernetes clusters (GKE preferred)
Deploy services using Helm
Manage containerized workloads using Docker

Automation & Tooling

Strong Python skills with emphasis on reliability automation and observability tooling
Develop automation and tooling using Python
Create internal reliability and monitoring tools
Integrate CI/CD pipelines with observability and reliability checks

Collaboration & Leadership