Platform Reliability & Observability Lead (SRE)

Fortrea

Job Location:

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Job Overview:

The Platform Reliability & Observability Lead (SRE) will own and elevate the reliability availability and operational excellence of its hosting and platform services. This is an engineering led role accountable for measurable reliability outcomes across cloud and hybrid environments supporting regulated clinical workloads. The role leads observability strategy SLO and error budget programs incident automation and root cause engineering ensuring platforms are resilient predictable compliant and scalable. This position is critical to enabling Operational Excellence Embedded Quality Financial Discipline and Customer Trust.

Summary of Responsibilities:

Engineer reliability into hosting and platform services through design reviews resilience patterns and readiness assessments.
Define and enforce standards for availability latency durability recoverability and scalability.
Own endtoend observability strategy including metrics logs traces alerting dashboards and service health reporting.
Establish and operationalize SLIs SLOs and error budgets to guide prioritization release readiness and risk decisions.
Design and automate incident detection triage mitigation rollback and diagnostics to improve MTTD and MTTR.
Lead blameless postincident reviews identify systemic issues and drive remediation to closure.
Reduce operational toil through automation engineering rigor and selfservice tooling.
Partner with cloud hosting IaC and application teams to embed reliability into the SDLC.

Qualifications (Minimum Required):

Bachelors degree in computer science Computer Engineering or a related field.
Excellent communication and public speaking skills with the ability to present complex architectural concepts to senior leadership technical teams and nontechnical stakeholders.
Fortrea may consider relevant and equivalent experience in lieu of educational requirements.

Required skills (Minimum Required):

9 years in Site Reliability Engineering Platform Engineering or Production Engineering.
Proven ownership of production reliability in cloud or hybrid platforms.
Strong foundations in distributed systems Linux networking and system internals.
Handson experience with observability architectures and alerting best practices.
Strong expertise in SLIs SLOs SLAs and error budgets.
Proficiency in Python Go Java or equivalent with a strong automation mindset.
Experience with Azure (preferred) AWS or GCP
Experience with Kubernetes and Infrastructure as Code (Terraform Bicep ARM etc.)

Preferred Qualifications Include:

Regulated or GxP environments.
Open Telemetry distributed tracing and service dependency mapping.
Chaos engineering DR testing or resilience validation.
FinOps and costaware reliability engineering.
Building shared reliability or observability platforms.

Physical Demands / Work Environment:

Remote-Based as requested by the line manager
Work Timings: 2:00 PM IST to 11.00 PM IST

Learn more about our EEO & Accommodations request here.

Job Overview:The Platform Reliability & Observability Lead (SRE) will own and elevate the reliability availability and operational excellence of its hosting and platform services. This is an engineering led role accountable for measurable reliability outcomes across cloud and hybrid environments sup...

Job Overview:

Summary of Responsibilities:

Engineer reliability into hosting and platform services through design reviews resilience patterns and readiness assessments.
Define and enforce standards for availability latency durability recoverability and scalability.
Own endtoend observability strategy including metrics logs traces alerting dashboards and service health reporting.
Establish and operationalize SLIs SLOs and error budgets to guide prioritization release readiness and risk decisions.
Design and automate incident detection triage mitigation rollback and diagnostics to improve MTTD and MTTR.
Lead blameless postincident reviews identify systemic issues and drive remediation to closure.
Reduce operational toil through automation engineering rigor and selfservice tooling.
Partner with cloud hosting IaC and application teams to embed reliability into the SDLC.

Qualifications (Minimum Required):

Bachelors degree in computer science Computer Engineering or a related field.
Excellent communication and public speaking skills with the ability to present complex architectural concepts to senior leadership technical teams and nontechnical stakeholders.
Fortrea may consider relevant and equivalent experience in lieu of educational requirements.

Required skills (Minimum Required):

9 years in Site Reliability Engineering Platform Engineering or Production Engineering.
Proven ownership of production reliability in cloud or hybrid platforms.
Strong foundations in distributed systems Linux networking and system internals.
Handson experience with observability architectures and alerting best practices.
Strong expertise in SLIs SLOs SLAs and error budgets.
Proficiency in Python Go Java or equivalent with a strong automation mindset.
Experience with Azure (preferred) AWS or GCP
Experience with Kubernetes and Infrastructure as Code (Terraform Bicep ARM etc.)

Preferred Qualifications Include: