Workplace Platforms Site Reliability Engineer (SRE) Lead Dallas

Goldman Sachs

Job Location:

Dallas, IA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Description

Team Overview

The Workplace Engineering organization is responsible for the reliability resilience and operational integrity of the firms endpoint compute platforms and services including:

Corporateowned physical devices
Virtual and cloudhosted desktops
Core endpoint services such as device lifecycle management access and identity integration profile and session services and application delivery frameworks

The Endpoint Compute SRE function applies Site Reliability Engineering (SRE) principles to ensure these platforms and services are highly available observable scalable and recoverable while meeting operational and regulatory expectations.

Role Summary

We are seeking an Endpoint Compute SRE Lead to own reliability engineering and operational excellence across endpoint compute platforms and their foundational services.

This role is focused on systems and services not applications and covers the reliability of:

Endpoint compute platforms (physical virtual cloud desktops)
Device and desktop lifecycle services
Access and signin dependency platforms
Profile policy and session services
Application delivery and execution frameworks (packaging deployment availabilitynot app functionality)

The successful candidate will define service-level objectives observability strategies failure models and operational practices that ensure a predictable and resilient enduser compute experience at enterprise scale.

Job Responsibilities

Reliability Engineering Across Endpoint Services

Own end-to-end reliability of endpoint compute platforms and supporting services
Define service boundaries dependencies and critical paths from user signin through productive desktop use
Model failure modes and blast radius across lifecycle access and delivery services
Drive designs that support graceful degradation and fast recovery

Observability & Telemetry

Establish observability standards across endpoint compute services including:
- Enrollment and provisioning success rates
- Access and session establishment health
- Policy and profile delivery latency/failures
- Application delivery availability
Ensure telemetry enables:
- Fast incident detection
- Root cause analysis
- Proactive trend identification

SLOs SLIs & Error Budgets

Define SLOs and SLIs for key endpoint services (e.g. signin success provisioning time policy convergence)
Implement error budget frameworks to guide change security control rollout and platform evolution
Use reliability signals to influence platform design and operational priorities

Incident Problem & Resilience Management

Lead reliability aspects of incident response involving endpoint compute or services
Drive postincident reviews focused on systemic corrections
Identify recurring failure patterns in:
- Lifecycle flows
- Access paths
- Policy or profile delivery
Sponsor and track permanent fixes not workarounds

Operational Excellence & Automation

Define and maintain runbooks playbooks and escalation models for endpoint services
Drive automation to reduce:
- Manual remediation
- Repeat incidents
- Operational toil
Influence engineering designs to improve operability and debuggability

Risk & Governance Alignment

Partner with Technology Risk and Security teams to:
- Demonstrate reliability and recoverability controls
- Support operational risk and resilience assessments
- Provide auditready evidence for availability and incident management
Ensure reliability metrics support control effectiveness narratives

Leadership & Collaboration

Act as the reliability authority for endpoint compute and services
Partner closely with:
- Endpoint platform engineers
- Device management teams
- Security engineering and identity teams
Mentor engineers in applying SRE principles to workplace platforms
Communicate reliability posture clearly to leadership

Basic Qualifications

8 years in SRE platform operations reliability engineering or workplace infrastructure roles
Strong experience operating endpoint compute platforms and core supporting services at enterprise scale
Proven ability to define and implement:
- Observability frameworks
- SLOs / SLIs
- Incident and problem management models
Strong systems thinking across lifecycle access and service dependencies
Excellent documentation and communication skills

Preferred Qualifications

Experience applying SRE concepts to enduser computing or digital workplace platforms
Deep understanding of:
- Device lifecycle and provisioning services
- Identity and access dependencies (availability-focused)
- Profile policy and session orchestration
Experience in regulated or highassurance environments
Strong ability to influence architecture using datadriven reliability insights

What Success Looks Like

Endpoint compute and services have clear reliability targets
Lifecycle access and delivery failures are predictable observable and fast to remediate
Incidents are less frequent shorter and less impactful
Platforms are designed with operability and resilience built in
Leadership has confidence in desktop stability as a service

Required Experience:

DescriptionTeam OverviewThe Workplace Engineering organization is responsible for the reliability resilience and operational integrity of the firms endpoint compute platforms and services including:Corporateowned physical devicesVirtual and cloudhosted desktopsCore endpoint services such as device l...