Site Reliability Engineer M Level

Bangalore - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Job Description: Site Reliability Engineer (SRE)

Job Type: Full-Time – External Hire
Location: Any location in India
Experience: 9 –12 years
Role Level: Engineer / Senior SRE consultant

Designation: M

Role Summary

We are looking for an experienced Site Reliability Engineer (SRE) with strong knowledge of SRE practices monitoring & observability AIOps FinOps and application maintenance operations. The ideal candidate will evaluate existing observability ecosystems perform SRE maturity assessments data-driven recommendations based on SRE maturity assessments and support the implementation of SRE and AIOps practices across the organization.
This role requires both analytical and hands-on engineering capabilities with the ability to work across DevOps Cloud Architecture and Support Teams and enhance system reliability.

Key Responsibilities

1. Monitoring & Observability

Evaluate existing monitoring and observability tools (e.g. Dynatrace New Relic Datadog Splunk Grafana Prometheus AppDynamics OpenTelemetry ELK/EFK etc.).
Identify gaps in visibility alerting instrumentation log analytics and distributed tracing.
Define standards for SLO-driven monitoring alert thresholds dashboards and root-cause visibility.
Recommend tool improvements integrations or consolidations to improve reliability efficiency.

2. AIOps Implementation & Adoption

Evaluate readiness and maturity for AIOps adoption.
Work with operational and support teams to leverage:
- Event correlation & noise reduction
- Automated anomaly detection
- Predictive analytics for incidents
- AI-driven RCA insights
Integrate AIOps tools with observability and ITSM systems.
Build automation pipelines for:
- Self-healing actions
- Auto-remediation
- Intelligent alert routing
Provide guidance on selecting or optimizing AIOps platforms (e.g. Moogsoft BigPanda Dynatrace Davis AI Datadog AIOps).

3. FinOps for Reliability & Cost Optimization

Collaborate with Cloud & Finance teams to:
- Analyze cloud usage patterns (compute storage networking).
- Identify reliability-cost trade-offs (e.g. scaling vs overprovisioning).
- Recommend cost-efficient architectures without compromising availability.
Implement FinOps practices such as:
- Rightsizing resources
- Reserved instances vs ondemand usage
- Eliminating idle workloads
- Monitoring cost anomalies
Create cost dashboards integrating cloud billing with CI/CD and observability tools.
Partner with teams to ensure services meet both reliability and cost efficiency targets.

2. SRE Maturity Assessments

Conduct SRE capability and maturity assessments across application teams and operations.
Analyze system reliability metrics (SLOs SLIs error budgets) and organizational processes.
Provide actionable insights and roadmaps to improve SRE maturity including people/process/technology perspectives.
Track maturity evolution and support continuous improvement.

3. Incident Problem & Change Reliability

Work closely with application support/maintenance teams to:
- Improve incident reduction and Mean Time To Recovery (MTTR).
- Build automation to eliminate repetitive manual tasks (toil).
- Strengthen problem management through deep analytics.
Implement patterns for resilience engineering failure mode analysis and chaos testing (optional based on org).

4. SRE Implementation & Coaching

Support the rollout of SRE frameworks operating models and best practices.
Coach application teams on SRE concepts such as:
- Service Level Objectives (SLOs)
- Error Budgets
- Toil Elimination
- Release Reliability & Deployment Best Practices
- Observability-Driven Operations
Collaborate across DevOps QA Architecture and Operations to embed SRE into SDLC and support models.

5. Automation & Tooling

Develop automation scripts or small utilities using:
- Python Shell PowerShell Go or similar.
Support CI/CD pipeline integration with monitoring and reliability gates.
Implement auto-remediation or self-healing scripts where applicable.

6. Application Maintenance Support

Provide expert support for production systems with background in L2/L3 maintenance.
Analyze logs metrics traces and exceptions to troubleshoot issues effectively.
Review release changes for reliability risks and recommend improvements.

Required Skills & Experience

Technical Skills

Strong expertise in at least 2–3 monitoring/APM tools. (any combination):
- Datadog Dynatrace New Relic Splunk Prometheus Grafana ELK AppDynamics etc.
Experience with AIOps platforms (Moogsoft BigPanda Dynatrace Davis AI Datadog AIOps).
Working knowledge of FinOps including cloud cost analysis and optimization.
Strong knowledge of:
- Cloud platforms (AWS Azure or GCP)
- Linux/Unix fundamentals
- Microservices & API ecosystems
- Containerization (Docker Kubernetes)
- Automation scripting (Python / Shell / Go)
- CI/CD tooling (Jenkins Azure DevOps GitLab CI etc.)
Good understanding of:
- Networking concepts
- Distributed systems
- Application performance management (APM)
- Logging & tracing frameworks (OpenTelemetry preferred)

SRE Principles

Practical experience with:
- SLOs / SLIs / Error Budgets
- Incident & problem management
- Toil identification and elimination
- Reliability automation
- Release quality gates

Domain Experience

Strong experience in Application Maintenance / Production Support environments (mandatory).
Exposure to Agile/DevOps delivery models.
Familiar with ITIL concepts (preferable).

Soft Skills

Strong analytical and problem-solving skills.
Excellent communication and ability to collaborate with cross-functional teams.
Ability to work in a fast-paced environment with minimal supervision.
Proactive with a mindset for continuous improvement and reliability culture.

Preferred Qualifications

Certifications (optional):
- Google SRE Foundations / Professional Cloud DevOps Engineer
- AWS/Azure/GCP Cloud Certifications
- ITIL Foundation
FinOps Certified Practitioner (preferred).
Experience with automation tools or platforms.
Experience conducting maturity assessments or framework implementations.

Job Description: Site Reliability Engineer (SRE)Job Type: Full-Time – External HireLocation: Any location in IndiaExperience: 9 –12 yearsRole Level: Engineer / Senior SRE consultantDesignation: M Role SummaryWe are looking for an experienced Site Reliability Engineer (SRE) with strong knowledge of S...