Director of Production Engineering (Reliability Platform Engineering)

External

Not Interested
Bookmark
Report This Job

profile Job Location:

Durham - UK

profile Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

Toshiba Global Commerce Solutions is seeking a Director of Production Engineering (Reliability Platform Engineering) to lead the reliability backbone of our global POS cloud and middleware platform. This strategic role owns system availability resilience performance observability and release reliability across a distributed mission-critical commerce ecosystem.

This leader will unify Site Reliability Engineering (SRE) Resilience & Performance Engineering Observability and AI-driven Reliability Automation into one cohesive function. As AI acceleratesdevelopmentvelocity verification and reliability become the corebottlenecksmaking this role a cornerstone of our engineering organization.

You will partner closely with Architecture Cloud Operations Functional Quality Engineering and Software Development to ensure predictable reliability smooth releasesand dramatically fewer Sev-1/Sev-2 incidents.

Responsibilities

System Reliability & Uptime:

  • Define and enforce SLO/SLA frameworks error budgets and release criteria
  • Lead availability resilience and performance strategy across all services.
  • Own MTTR MTBF incident prevention and rollback strategies at scale.

Unified Reliability Engineering Organization:

  • Lead teams across SRE & L3 Engineering Resilience & Performance
  • Engineering Observability & Telemetry AI Reliability Automation.
  • Build a culture focused on prevention over firefighting.

Architecture-Level Reliability:

  • Collaborate with Principal Engineers and Architects to define system guardrails resilience patterns and failure modes.
  • Ensure high-quality Production Readiness Reviews (PRRs) and architectural consistency.

Resilience & Performance Engineering:

  • Ownchaos failover load stress and soak testing strategies.
  • Validate store-mode behavior payment workflows edge-device dependencies and multi-service interactions.

Observability & Telemetry:

  • Ensure completeaccuratesignal for logs traces metrics and business health.
  • Partner with AI systems to build intelligent anomaly detection pipelines.

AI-Driven Release Reliability:

  • Integrate AI-based reliability scoring resiliency prediction automated gating regression analysis and incident pattern detection.
  • Define the path toward autonomousreleasereliability pipelines.

Cross-Org Leadership:

  • Partnerwith Software Development Functional Quality Engineering Cloud Operations Architecture and TPM/TPO teams.
  • Drive multi-team initiatives and ensure readiness across complex release trains.

Required Experience:

  • Bachelors Degree in Computer Science Engineering or 10-15 years direct experience.
  • 1015 years in SRE Reliability Engineering Production EngineeringDistributed Systems and Performance/Resilience Engineering
  • Proven ownership of uptime and system reliability in complex distributed architectures.
  • Expertise in distributed systems cloud platforms (AKS Kubernetes) observability stacks (OpenTelemetry Grafana App Insights Datadog) performance tuning fault tolerance network fundamentals DB/service scaling chaos testing
  • Architectural Leadership: Experience designing resilience patterns (timeouts retries hedging circuit breakers). Strong partnership with architects and senior engineers.
  • Operational Maturity: Led SRE/on-call organizations. Defined SLOs SLIs and error budgets at scale. Track record of driving incident prevention culture.
  • Leadership & Communication: Builds strong engineering teams and hires top talent.
  • Influential communicator with executives and cross-functional teams. Highly collaborative and low-ego.

Preferred Requirements

  • AI-driven anomaly detection regression analysis incident clustering reliabilityscoring.
  • Experience with retail POS payments edge devices or store environments.
    Hybrid cloud edgearchitectures.
  • Leading reliability transformations and scaling engineering organizations (200500).

Why This Role Matters

As AI accelerates development velocity the bottleneck shifts from coding to verification reliability andreleasesafety. This role ensures:
- Uptime becomes engineered not reactive.
- Development and QAoperateat AI-enabled speed.
- Our platform grows safely while delivering stability and performance.
- We match or surpass best-in-class tech organizations (Google Amazon Azure Stripe).

You will build the production engineering foundation that powers our next decade of innovation.

Toshiba Global Commerce Solutions is a dynamic billion-dollar global company based in Research Triangle Park NC providing retail store solutions to your favorite brands. Have you ever been in a hurry and made use of the self-checkout at Lowes Foods earned fuel rewards at Kroger or just paid for purchases at retailers such as Walmart Michaels Carrefour The Gap Calvin Klein Boots Cencosud BJs or Costco These are just a few examples of our in-store solutions and impressive customer base that made us the worlds installed market share leader.

The nature of retail is changing quickly so if you share our Together Commerce vision of a seamless two-way participatory shopping experienceletsget together to drive the new economy.

Toshiba Global Commerce Solutions Inc. offers a competitive salary and generous benefits package including the following:

  • Group health coverage (medical dental & vision)
  • Employee Assistance Programs
  • Pre-tax spending accounts
  • 401(k) plan (with company match)
  • Company provided life insurance
  • Pet Insurance
  • Employee discounts
  • Generous paid holiday schedule paid vacation & sick/personal days



EEO:

Toshiba Global Commerce Solutions is an equal opportunity/affirmative action employer that evaluates qualified applicants without regard to age ancestry color religious creed disability marital status medical condition genetic information military or veteran status national origin race sex gender gender identity gender expression and sexual orientation or any other protected factor. We also consider qualified applicants regardless of criminal histories consistent with legal requirements.

Individuals who need a reasonable accommodation because of a disability for any part of the employment process should emailto request an accommodation

DIVERSITY EQUITY & INCLUSION:

We at Toshiba Global Commerce Solutions firmly believe that our people are an integral part to the success of our customers. Furthermore were committed to Diversity Equity and Inclusion for all our people as highlighted by our 5 Core Principles (Create Outreach Foster Belonging Unleash Opportunity Diverse Cultural Engagement and Culture of Transparency). Were passionate about ourcustomersthe retail industry andbecominga more responsible company as we help create a brighter future.


Required Experience:

Director

Toshiba Global Commerce Solutions is seeking a Director of Production Engineering (Reliability Platform Engineering) to lead the reliability backbone of our global POS cloud and middleware platform. This strategic role owns system availability resilience performance observability and release reliabi...
View more view more

Key Skills

  • Go
  • Lean
  • Management Experience
  • React
  • Node.js
  • Operations Management
  • Project Management
  • Research & Development
  • Software Development
  • Team Management
  • GraphQL
  • Leadership Experience

About Company

Company Logo

RJ Young partners with organizations to provide innovative office & IT technology, helping them streamline operations, boost productivity, and security.

View Profile View Profile