Asset & Wealth Management AM Position Data Technology Vice President Software Engineering Bengaluru

Bengaluru - India

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Who We Are

At Goldman Sachs we connect people capital and ideas to help solve problems for our clients. We are a leading global financial services firm providing investment banking securities and investment management services to a substantial and diversified client base that includes corporations financial institutions governments and individuals.

Site Reliability Engineer (SRE) Incident Management Escalation and Automation

Overview

We are seeking a seasoned Site Reliability Engineer who excels at incident response and management with a strong emphasis on escalation discipline and crisp audience-appropriate communications.

You will partner closely with front-office trading desks engineering and fellow SRE colleagues and Application Business Operations (ABO) to enhance desk readiness reduce manual workload through strategic automation and AI and raise the bar on observability capacity and change quality across globally distributed systems. This role includes stewardship of cross-region handoffs governance of error budgets and the establishment of clear SRE KPIs to demonstrate value and drive continuous improvement.

Key Responsibilities

Incident Command Escalation and Communications
Act as Incident Commander for high-severity events ensuring timely escalation resolver engagement and transparent communications to technical and business stakeholders.
Maintain consistent status updates incident timelines and customer/leadership communications; improve comms templates and runbooks for clarity and speed.
Drive post-incident reviews with a blameless learning-first approach; produce actionable remediation items owners and due dates.
Cross-Region Handoffs and Desk Readiness
Own the cross-region handoff procedure to ensure emerging issues are surfaced globally with explicit ownership clear next steps and desk-readiness checklists.
Ensure shift notes incident context and risk hot-spots are consistently captured discoverable and actioned.
ABO Partnership and Workload Reduction
Partner closely with ABO to identify incident/issue trends and patterns; quantify impact and prioritize engineering fixes that remove manual workarounds.
Provide visibility into ABO workload; escalate when prioritization is needed for engineering solutions that reduce toil.
Strategic Automation and AI
Apply engineering tenets to automate repetitive tasks codify remediations and implement self-healing mechanisms; evaluate and responsibly adopt AI to improve triage runbook execution and anomaly detection.
Track toil reduction and time saved; feed back into prioritization and capacity planning.
Observability Monitoring and Alert Quality
Collaborate with developers to improve instrumentation SLIs dashboards and actionable alerts aligned to firmwide standards and globally consistent tooling.
Reduce alert noise and increase signal-to-noise ratio via better thresholds aggregation deduplication and suppression; validate alert-to-action mapping with runbooks and ownership.
Expand tracing logging and metrics coverage to speed detection triage and root cause isolation.
SLOs Error Budgets and Reliability Governance
Define and steward SLOs and SLIs across services; implement and manage error budgets with clear policies influencing release velocity and risk acceptance.
Facilitate data-driven tradeoffs between feature delivery and reliability; regularly review budget burn with product and engineering.
Capacity Engineering and Scalability
Drive capacity engineering standards; partner with teams on forecasting scaling strategies and reporting (leading indicators saturation headroom).
Work with developers to automate capacity tests limit management and scaling actions; ensure predictable behavior under load and graceful degradation.
Change Quality and ORR Gatekeeping
Oversee change quality across environments; reduce change-related incidents through pre-deployment checks progressive delivery and canaries.
Serve as ORR (Operational Readiness Review) gatekeepers to validate observability runbooks on-call readiness rollback plans and dependencies before go-live.
Documentation Runbooks and Training
Review and improve documentation freshness clarity and completeness; identify and automate runbook steps with high repeatability.
Train developers on SRE fundamentals: SLOs/SLIs error budgets incident roles on-call hygiene and production-readiness best practices.
KPIs and Reporting
Establish track and publish SRE KPIs and OKRs to evidence value including MTTD MTTA MTTR incident frequency and severity distribution change failure rate error budget burn alert quality toil reduction and capacity headroom.
Produce regular executive-ready reports and partner dashboards; highlight trends risks and the impact of reliability investments.

Qualifications

8 years in SRE production operations or reliability-focused engineering supporting high-availability customer-facing or trading/front-office systems.
Proven experience as Incident Commander with measurable improvements in escalation timeliness communications quality and MTTR.
Strong foundations in Linux networking (DNS HTTP TLS routing) distributed systems and public cloud (AWS/Azure/GCP).
Hands-on with observability stacks (e.g. Prometheus Grafana OpenTelemetry ELK) incident tooling (e.g. PagerDuty Opsgenie) and collaboration platforms (e.g. Slack/Teams).
Proficiency with infrastructure-as-code and automation (e.g. Terraform CloudFormation Ansible) and at least one modern programming language (Go Python).
Experience implementing SLO/SLI/error budgets capacity planning progressive delivery (feature flags canary blue/green) and chaos/game days.
Excellent written and verbal communication; able to translate complex technical contexts into concise updates for executives and business stakeholders.
Comfortable working across time zones with strong ownership of cross-region handoffs and follow-through.

Preferred Experience

Front-office/trading or similarly latency- and availability-sensitive environments; close partnership with business operations (ABO) or site operations teams.
Kubernetes-based microservices service meshes multi-region architectures and global standards harmonization.
Building AI-assisted operations (alert enrichment anomaly detection runbook copilots) with measurable toil reduction.
Operating status pages and customer-facing incident communications.
Implementing ITIL-aligned processes adapted to SRE practices; ORR frameworks and governance.

Success Metrics

Faster detection and resolution: lower MTTD MTTA MTTR for incidents.
Higher alert quality: reduced volume higher precision clear actionability.
Reduced change failure rate; increased success of progressive rollouts.
Measurable toil reduction for ABO and engineering through automation and AI.
Improved capacity predictability through documented headroom fewer saturation events.
Documentation freshness and runbook automation coverage.
Positive stakeholder feedback on handoffs communications and incident leadership.

Goldman Sachs Engineering Culture

At Goldman Sachs our Engineers dont just make things we make things possible. Change the world by connecting people and capital with ideas. Solve the most challenging and pressing engineering problems for our clients. Join our engineering teams that build massively scalable software and systems architect low latency infrastructure solutions proactively guard against cyber threats and leverage machine learning alongside financial engineering to continuously turn data into action. Create new businesses transform finance and explore a world of opportunity at the speed of markets.

Engineering is at the critical center of our business and our dynamic environment requires innovative strategic thinking and immediate real solutions. Want to push the limit of digital possibilities Start here!

Goldman Sachs is an equal employment/affirmative action employer Female/Minority/Disability/Veteran/Sexual Orientation/Gender Identity.

Required Experience:

Exec

DescriptionWho We AreAt Goldman Sachs we connect people capital and ideas to help solve problems for our clients. We are a leading global financial services firm providing investment banking securities and investment management services to a substantial and diversified client base that includes corp...

Description

Who We Are

Site Reliability Engineer (SRE) Incident Management Escalation and Automation

Overview

We are seeking a seasoned Site Reliability Engineer who excels at incident response and management with a strong emphasis on escalation discipline and crisp audience-appropriate communications.

Key Responsibilities

Incident Command Escalation and Communications
Act as Incident Commander for high-severity events ensuring timely escalation resolver engagement and transparent communications to technical and business stakeholders.
Maintain consistent status updates incident timelines and customer/leadership communications; improve comms templates and runbooks for clarity and speed.
Drive post-incident reviews with a blameless learning-first approach; produce actionable remediation items owners and due dates.
Cross-Region Handoffs and Desk Readiness
Own the cross-region handoff procedure to ensure emerging issues are surfaced globally with explicit ownership clear next steps and desk-readiness checklists.
Ensure shift notes incident context and risk hot-spots are consistently captured discoverable and actioned.
ABO Partnership and Workload Reduction
Partner closely with ABO to identify incident/issue trends and patterns; quantify impact and prioritize engineering fixes that remove manual workarounds.
Provide visibility into ABO workload; escalate when prioritization is needed for engineering solutions that reduce toil.
Strategic Automation and AI
Apply engineering tenets to automate repetitive tasks codify remediations and implement self-healing mechanisms; evaluate and responsibly adopt AI to improve triage runbook execution and anomaly detection.
Track toil reduction and time saved; feed back into prioritization and capacity planning.
Observability Monitoring and Alert Quality
Collaborate with developers to improve instrumentation SLIs dashboards and actionable alerts aligned to firmwide standards and globally consistent tooling.
Reduce alert noise and increase signal-to-noise ratio via better thresholds aggregation deduplication and suppression; validate alert-to-action mapping with runbooks and ownership.
Expand tracing logging and metrics coverage to speed detection triage and root cause isolation.
SLOs Error Budgets and Reliability Governance
Define and steward SLOs and SLIs across services; implement and manage error budgets with clear policies influencing release velocity and risk acceptance.
Facilitate data-driven tradeoffs between feature delivery and reliability; regularly review budget burn with product and engineering.
Capacity Engineering and Scalability
Drive capacity engineering standards; partner with teams on forecasting scaling strategies and reporting (leading indicators saturation headroom).
Work with developers to automate capacity tests limit management and scaling actions; ensure predictable behavior under load and graceful degradation.
Change Quality and ORR Gatekeeping
Oversee change quality across environments; reduce change-related incidents through pre-deployment checks progressive delivery and canaries.
Serve as ORR (Operational Readiness Review) gatekeepers to validate observability runbooks on-call readiness rollback plans and dependencies before go-live.
Documentation Runbooks and Training
Review and improve documentation freshness clarity and completeness; identify and automate runbook steps with high repeatability.
Train developers on SRE fundamentals: SLOs/SLIs error budgets incident roles on-call hygiene and production-readiness best practices.
KPIs and Reporting
Establish track and publish SRE KPIs and OKRs to evidence value including MTTD MTTA MTTR incident frequency and severity distribution change failure rate error budget burn alert quality toil reduction and capacity headroom.
Produce regular executive-ready reports and partner dashboards; highlight trends risks and the impact of reliability investments.

Qualifications

8 years in SRE production operations or reliability-focused engineering supporting high-availability customer-facing or trading/front-office systems.
Proven experience as Incident Commander with measurable improvements in escalation timeliness communications quality and MTTR.
Strong foundations in Linux networking (DNS HTTP TLS routing) distributed systems and public cloud (AWS/Azure/GCP).
Hands-on with observability stacks (e.g. Prometheus Grafana OpenTelemetry ELK) incident tooling (e.g. PagerDuty Opsgenie) and collaboration platforms (e.g. Slack/Teams).
Proficiency with infrastructure-as-code and automation (e.g. Terraform CloudFormation Ansible) and at least one modern programming language (Go Python).
Experience implementing SLO/SLI/error budgets capacity planning progressive delivery (feature flags canary blue/green) and chaos/game days.
Excellent written and verbal communication; able to translate complex technical contexts into concise updates for executives and business stakeholders.
Comfortable working across time zones with strong ownership of cross-region handoffs and follow-through.

Preferred Experience

Front-office/trading or similarly latency- and availability-sensitive environments; close partnership with business operations (ABO) or site operations teams.
Kubernetes-based microservices service meshes multi-region architectures and global standards harmonization.
Building AI-assisted operations (alert enrichment anomaly detection runbook copilots) with measurable toil reduction.
Operating status pages and customer-facing incident communications.
Implementing ITIL-aligned processes adapted to SRE practices; ORR frameworks and governance.

Success Metrics

Faster detection and resolution: lower MTTD MTTA MTTR for incidents.
Higher alert quality: reduced volume higher precision clear actionability.
Reduced change failure rate; increased success of progressive rollouts.
Measurable toil reduction for ABO and engineering through automation and AI.
Improved capacity predictability through documented headroom fewer saturation events.
Documentation freshness and runbook automation coverage.
Positive stakeholder feedback on handoffs communications and incident leadership.

Goldman Sachs Engineering Culture

Goldman Sachs is an equal employment/affirmative action employer Female/Minority/Disability/Veteran/Sexual Orientation/Gender Identity.

Required Experience:

Exec

Key Skills

Apply Now

About Company

Goldman Sachs

The Goldman Sachs Group, Inc. is a leading global investment banking, securities, and asset and wealth management firm that provides a wide range of financial services.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click