Site Reliability Engineer M Level

Not Interested
Bookmark
Report This Job

profile Job Location:

Bangalore - India

profile Monthly Salary: Not Disclosed
Posted on: 8 days ago
Vacancies: 1 Vacancy

Job Summary

Job Description: Site Reliability Engineer (SRE)

Job Type: Full-Time – External Hire
Location: Any location in India
Experience: 9 –12 years
Role Level: Engineer / Senior SRE consultant

Designation: M

Role Summary

We are looking for an experienced Site Reliability Engineer (SRE) with strong knowledge of SRE practices monitoring & observability AIOps FinOps and application maintenance operations. The ideal candidate will evaluate existing observability ecosystems perform SRE maturity assessments data-driven recommendations based on SRE maturity assessments and support the implementation of SRE and AIOps practices across the organization.
This role requires both analytical and hands-on engineering capabilities with the ability to work across DevOps Cloud Architecture and Support Teams and enhance system reliability.

Key Responsibilities

1. Monitoring & Observability

  • Evaluate existing monitoring and observability tools (e.g. Dynatrace New Relic Datadog Splunk Grafana Prometheus AppDynamics OpenTelemetry ELK/EFK etc.).
  • Identify gaps in visibility alerting instrumentation log analytics and distributed tracing.
  • Define standards for SLO-driven monitoring alert thresholds dashboards and root-cause visibility.
  • Recommend tool improvements integrations or consolidations to improve reliability efficiency.

2. AIOps Implementation & Adoption

  • Evaluate readiness and maturity for AIOps adoption.
  • Work with operational and support teams to leverage:
    • Event correlation & noise reduction
    • Automated anomaly detection
    • Predictive analytics for incidents
    • AI-driven RCA insights
  • Integrate AIOps tools with observability and ITSM systems.
  • Build automation pipelines for:
    • Self-healing actions
    • Auto-remediation
    • Intelligent alert routing
  • Provide guidance on selecting or optimizing AIOps platforms (e.g. Moogsoft BigPanda Dynatrace Davis AI Datadog AIOps).

3. FinOps for Reliability & Cost Optimization

  • Collaborate with Cloud & Finance teams to:
    • Analyze cloud usage patterns (compute storage networking).
    • Identify reliability-cost trade-offs (e.g. scaling vs overprovisioning).
    • Recommend cost-efficient architectures without compromising availability.
  • Implement FinOps practices such as:
    • Rightsizing resources
    • Reserved instances vs ondemand usage
    • Eliminating idle workloads
    • Monitoring cost anomalies
  • Create cost dashboards integrating cloud billing with CI/CD and observability tools.
  • Partner with teams to ensure services meet both reliability and cost efficiency targets.

2. SRE Maturity Assessments

  • Conduct SRE capability and maturity assessments across application teams and operations.
  • Analyze system reliability metrics (SLOs SLIs error budgets) and organizational processes.
  • Provide actionable insights and roadmaps to improve SRE maturity including people/process/technology perspectives.
  • Track maturity evolution and support continuous improvement.

3. Incident Problem & Change Reliability

  • Work closely with application support/maintenance teams to:
    • Improve incident reduction and Mean Time To Recovery (MTTR).
    • Build automation to eliminate repetitive manual tasks (toil).
    • Strengthen problem management through deep analytics.
  • Implement patterns for resilience engineering failure mode analysis and chaos testing (optional based on org).

4. SRE Implementation & Coaching

  • Support the rollout of SRE frameworks operating models and best practices.
  • Coach application teams on SRE concepts such as:
    • Service Level Objectives (SLOs)
    • Error Budgets
    • Toil Elimination
    • Release Reliability & Deployment Best Practices
    • Observability-Driven Operations
  • Collaborate across DevOps QA Architecture and Operations to embed SRE into SDLC and support models.

5. Automation & Tooling

  • Develop automation scripts or small utilities using:
    • Python Shell PowerShell Go or similar.
  • Support CI/CD pipeline integration with monitoring and reliability gates.
  • Implement auto-remediation or self-healing scripts where applicable.

6. Application Maintenance Support

  • Provide expert support for production systems with background in L2/L3 maintenance.
  • Analyze logs metrics traces and exceptions to troubleshoot issues effectively.
  • Review release changes for reliability risks and recommend improvements.

Required Skills & Experience

Technical Skills

  • Strong expertise in at least 2–3 monitoring/APM tools. (any combination):
    • Datadog Dynatrace New Relic Splunk Prometheus Grafana ELK AppDynamics etc.
  • Experience with AIOps platforms (Moogsoft BigPanda Dynatrace Davis AI Datadog AIOps).
  • Working knowledge of FinOps including cloud cost analysis and optimization.
  • Strong knowledge of:
    • Cloud platforms (AWS Azure or GCP)
    • Linux/Unix fundamentals
    • Microservices & API ecosystems
    • Containerization (Docker Kubernetes)
    • Automation scripting (Python / Shell / Go)
    • CI/CD tooling (Jenkins Azure DevOps GitLab CI etc.)
  • Good understanding of:
    • Networking concepts
    • Distributed systems
    • Application performance management (APM)
    • Logging & tracing frameworks (OpenTelemetry preferred)

SRE Principles

  • Practical experience with:
    • SLOs / SLIs / Error Budgets
    • Incident & problem management
    • Toil identification and elimination
    • Reliability automation
    • Release quality gates

Domain Experience

  • Strong experience in Application Maintenance / Production Support environments (mandatory).
  • Exposure to Agile/DevOps delivery models.
  • Familiar with ITIL concepts (preferable).

Soft Skills

  • Strong analytical and problem-solving skills.
  • Excellent communication and ability to collaborate with cross-functional teams.
  • Ability to work in a fast-paced environment with minimal supervision.
  • Proactive with a mindset for continuous improvement and reliability culture.

Preferred Qualifications

  • Certifications (optional):
    • Google SRE Foundations / Professional Cloud DevOps Engineer
    • AWS/Azure/GCP Cloud Certifications
    • ITIL Foundation
  • FinOps Certified Practitioner (preferred).
  • Experience with automation tools or platforms.
  • Experience conducting maturity assessments or framework implementations.

Job Description: Site Reliability Engineer (SRE)Job Type: Full-Time – External HireLocation: Any location in IndiaExperience: 9 –12 yearsRole Level: Engineer / Senior SRE consultantDesignation: M Role SummaryWe are looking for an experienced Site Reliability Engineer (SRE) with strong knowledge of S...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting