Senior Site Reliability Engineer

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Why Lytx Senior Site Reliability Engineer

At Lytx our engineering culture is built around being hungry low-ego and highly capable. We are pragmatic engineers who take ownership collaborate openly and focus on delivering measurable operational impact. Our mission is to design operate and continuously improve the cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services at scale.

As our systems grow in scale and complexity we are investing in modern observability intelligent automation and data-driven operations to improve reliability reduce operational noise and enable faster detection and recovery.

The Site Reliability Engineering (SRE) team is responsible for the availability reliability observability and resilience of our cloud-native environments. This includes building automation improving operational visibility and partnering with engineering teams to ensure services are designed and operated for reliability and scale.

As a Senior SRE you will lead reliability improvements for critical services and platforms contribute to observability and automation initiatives and help drive operational excellence through proactive engineering and continuous improvement.

If you enjoy solving complex production challenges improving system insight and building automation that makes operations more efficient and reliable this role is a great fit.

Responsibilities / Youll get to

Service & Platform Reliability Ownership - Own the reliability performance and operational health of critical services and infrastructure components ensuring systems meet availability and performance expectations.

Observability Implementation - Design and implement monitoring logging tracing and alerting to improve system visibility and ensure high-signal low-noise operational insights.

Operational Automation - Build automation and tooling to reduce manual operational work including runbook automation self-healing workflows and operational scripting.

Incident Response & Resolution - Lead response for high-severity incidents within your domain participate in on-call rotations and drive timely resolution to restore service.

Postmortems & Continuous Improvement - Conduct blameless postmortems identify root causes and implement corrective actions that prevent recurrence and improve system resilience.

Capacity & Performance Management - Analyze system performance and usage trends to support capacity planning scaling strategies and cost-efficient resource utilization.

Cloud & Infrastructure Engineering - Design deploy and operate scalable infrastructure in AWS using Infrastructure-as-Code and cloud-native best practices.

Cross-Functional Collaboration - Partner with product platform and development teams to embed reliability observability and performance best practices into system design and delivery.

AIOps & Operational Intelligence (Exposure) - Contribute to initiatives that improve operational signal quality such as alert tuning event correlation anomaly detection or automated remediation.

Team Contribution - Share operational knowledge and contribute to a culture of ownership learning and operational excellence.

Requirements / Youll Need

Experience

4 - 6 years of experience in SRE DevOps platform engineering or cloud infrastructure roles supporting production environments.
Experience operating and supporting 24/7 systems including participation in on-call rotations and incident response.

Cloud & Infrastructure

Hands-on experience designing and operating production workloads in AWS including services such as EC2 EKS/ECS RDS/DynamoDB S3 ALB/NLB VPC IAM and CloudWatch.
Experience building infrastructure using Terraform CloudFormation or similar Infrastructure-as-Code tools.

Observability

Experience implementing monitoring and alerting using tools such as New Relic Datadog Prometheus/Grafana CloudWatch or similar.
Exposure to OpenTelemetry or modern telemetry standards.
Ability to improve alert quality dashboards and operational visibility.
Experience with alert noise reduction anomaly detection or other data-driven operational improvements.

Automation & Scripting

Strong scripting or programming skills (Python Go Bash or similar) for operational automation and tooling.

Systems Knowledge

Solid understanding of Linux systems networking fundamentals (TCP/IP DNS TLS) and distributed system behavior.
Experience with Kubernetes and cloud-native architectures preferred.

Operational Excellence

Familiarity with AI/ML-assisted operational tooling or AIOps concepts.
Experience performing root cause analysis and driving reliability improvements.
Ability to troubleshoot complex production issues under pressure.

Collaboration & Communication

Strong collaboration skills with the ability to work across engineering teams.
Ability to influence reliability improvements within your domain through technical leadership and clear communication.

Innovation Lives Here

Together we help save lives on our roadways.

Find out how good it feels to be a part of an inclusive collaborative team. Were committed to delivering an environment where everyone feels valued included and supported to do their best work and share their voices.

Lytx Inc. is proud to be an equal opportunity/affirmative action employer and maintains a drug-free workplace. Were committed to attracting retaining and maximizing the performance of a diverse and inclusive workforce. EOE/M/F/Disabled/Vet.

Required Experience:

Senior IC

Why Lytx Senior Site Reliability EngineerAt Lytx our engineering culture is built around being hungry low-ego and highly capable. We are pragmatic engineers who take ownership collaborate openly and focus on delivering measurable operational impact. Our mission is to design operate and continuously...