Senior Director Cloud Platform Engineering

MX

Not Interested
Bookmark
Report This Job

profile Job Location:

Lehi, UT - USA

profile Monthly Salary: Not Disclosed
Posted on: 14 hours ago
Vacancies: 1 Vacancy

Job Summary

Life at MX

We are driven by our moral imperative to advance mankind - and it all starts with our people product and purpose. We always carry a deep sense of drive and passion with us. If you thrive in a challenging work environment surrounded by incredible team members who will help you grow MX is the right place for you.

Come build with us and be part of an award-winning company thats helping create meaningful and lasting change in the financial industry.

About the Role

We are looking for a US-based Senior Director that would be a strategic operational execution and escalation owner for all the site infrastructure and cloud platform services.

This role is personally accountable for the production reliability and stability including owning US time-zone incidents Sev 0/1 events leading cutovers and directly representing site infrastructure and platforms to executive leadership during high-impact events. The expectation is that this leader stands front of the line during critical incidents and events like migration and stabilization makes real-time decisions and clearly articulates risk impact and trade-offs to executives under pressure. This front-line ownership is intentional but transitional. A core measure of success for this role is building the systems operating model delegation structure and a strong leadership bench such that sustained high-quality operations do not depend on the continuous personal presence of a single leader. The leader is expected to design for leverage: establishing clear ownership developing managers/leaders and embedding practices that scale reliability beyond individual heroics.

In parallel they are expected to lead the full lifecycle of our infrastructure transformation from data center exit and AWS migration through steady-state cloud operations and platform maturity. Success is measured not just by completing the migration but by leaving behind a durable operating model with clear delegation on-call ownership and predictable executive engagement. The ideal candidate will have personally led large-scale data center exits and Cloud migrations not just advised or governed them.

What Success Looks Like in 1224 Months

  • 100% exit from on-premise data centers with all targeted workloads successfully migrated to AWS and on-prem dependencies fully decommissioned.

  • A clear stable post-migration operating model in place with unambiguous ownership across teams.

  • 99.99% Availability for Platform and infrastructure services consistently with active error budget management guiding operational and delivery decisions.

  • Reduction in Sev 0 and Sev 1 incidents with measurable reduction in customer-impacting events and improved predictability of recovery.

  • Improved incident KPIs including faster MTTR and reduced incident recurrence.

  • Declining operational toil through automation standardization and self-service platform capabilities.

  • Mature incident management practices including blameless postmortems and systemic remediation of root causes.

  • A strong leadership bench providing resilient production coverage confident incident leadership and effective delegation.

  • Improved cost efficiency and visibility across cloud infrastructure post-migration through FinOps practices capacity right-sizing and platform standardization.

Job Duties

Lead Data Center Exit & AWS Migration

  • Personally own and execute the end-to-end data center exit and AWS migration from discovery and planning through cutover stabilization and full decommissioning.

  • Define migration waves readiness gates and cutover plans with explicit transition into steady-state ownership avoiding temporary or parallel operating models.

  • Own architectural decisions across AWS networking compute storage security and observability ensuring designs are operable supportable and resilient post-migration.

  • Establish and own the post-migration operating model for cloud infrastructure and platforms explicitly tied to outcomes:

    • Clearly defined SLIs SLOs and error budgets for all Tier-1 and Tier-2 services

    • Accountable owners for SLO attainment across SRE platform and product teams

    • On-call and escalation models that provide durable time-zone coverage

    • Ongoing Cost-efficiency and optimization

    • Incident response change management and release practices aligned to reliability targets.

    • Post-migration roadmap for Platforms

  • Hold teams accountable for post-migration reliability metrics including:

    • SLO compliance and error budget burn

    • Sev 0 / Sev 1 incident frequency and customer impact

    • MTTR and incident recurrence rates.

  • Ensure migration execution does not introduce long-term operational debt and that workloads transition cleanly into measured observable and well-owned cloud operations.

  • Lead physical and logical data center decommissioning only after post-migration SLOs are consistently met and incident KPIs have stabilized.

Build and Evolve the Cloud Platform

  • Own the vision roadmap and execution for the companys cloud platform ensuring it supports both migration needs and long-term steady-state operations on AWS.

  • Own core platform capabilities and tooling strategy such as Kubernetes (EKS) CI/CD pipelines infrastructure-as-code identity and access management secrets management observability and disaster recovery.

  • Deliver self-service opinionated platform services that improve developer productivity while meeting security and reliability standards.

  • Modernize legacy and architect for Multi-Tenant SaaS: Enable secure and efficient scaling across tenants in AWS with attention to cost compliance and observability

  • Drive platform standardization to reduce fragmentation operational toil and cognitive load for product engineering teams.

  • Partner closely with application and product engineering to ensure the platform accelerates delivery while maintaining reliability and compliance.

Incident Management SRE & Operational Resilience Leadership

  • Own and evolve the end-to-end incident management lifecycle for infrastructure and platform services grounded in SRE principles of reliability learning and automation.

  • Define and enforce SLIs SLOs and error budgets for platform and infrastructure services using them to guide operational decisions release risk and incident response.

  • operate on a clear severity framework (Sev 0/1/2) with explicit ownership escalation paths and decision rights.

  • Lead the transition from incident response as heroics to incident prevention by design embedding reliability AIcapacity planning and failure-mode analysis into platform roadmaps and change processes.

  • Serve as the executive escalation owner for Sev 0 and Sev 1 incidents personally leading response trade-off decisions and executive communications when required while delegating incident command to empowered leaders to ensure sustained coverage.

  • Hold clear decision authority under pressure including the ability to unilaterally halt or roll back changes trigger failovers/traffic-shifts and disaster recovery actions reallocate engineering resources in demanding situations and make go/no-go cutover decisions to protect customers and data escalating to executive leadership when actions materially impact regulatory posture contractual commitments or significant financial exposure.

  • Build and maintain a US-based SRE and incident leadership bench with multiple leaders capable of acting as Incident Commander owning executive updates and coordinating cross-functional response.

  • Lead through error budgets and reliability signals to drive blameless postmortems root-cause analysis and prioritization of systemic fixes over short-term feature velocity.

  • Own the systematic reduction of operational toil and capacity tax across infrastructure and platform teams with clear accountability for ensuring reactive work declines as systems mature.

  • Hold teams accountable to measurable toil and resilience KPIs such as percentage of engineer time spent on reactive work on-call interrupt frequency manual intervention rates and incident recurrence.

  • Influence readiness through game days chaos testing and migration-specific drills validating both technical resilience and delegation models under pressure.

  • Ensure incident management tooling observability (metrics logs traces) and documentation are standardized well-owned and continuously improved.

Program Stakeholder and Executive Leadership

  • Partner with product engineering security enterprise architecture and finance to shape cloud migration and platform decisions that directly impact cost-to-serve unit economics and operational overhead ensuring infrastructure choices scale sustainably with business growth.

  • Drive architectural and platform standards that reduce total cost of ownership including infrastructure spend support burden reliability overhead and on-call load.

  • Embed FinOps and Reliability signals (utilization reliability cost incident-driven spend operational toil) into platform roadmaps and migration sequencing making trade-offs explicit between performance resilience speed and cost.

  • Translate infrastructure and platform choices into clear business outcomes such as per-customer cost per-transaction cost and support effort enabling executives to make informed investment and prioritization decisions.

  • Act as a trusted advisor on infrastructure and cloud strategy challenging assumptions and translating complex technical risks into clear business impact options and trade-offs to enable informed decision-making under pressure.

  • Build and delegate clear ownership and accountability for cloud migration timelines risks and outcomes.

  • Establish clear governance readiness reviews and success metrics for migration and platform initiatives.

  • Partner and guide steering committees technical working groups and cross-organizational readiness forums.

People and Organization Leadership

  • Own the design scale and effectiveness of the Cloud Platform Engineering organization including SRE cloud infrastructure and platform engineering teams across geographies.

  • Build and lead a strong leadership bench developing senior managers principal engineers and architects who can operate independently at scale.

  • Clearly define delegation decision rights and escalation paths so that critical incidents migrations and operational responsibilities are owned at the right level.

  • Drive organizational clarity across charters roles responsibilities and decision rights to reduce friction and increase delivery velocity.

  • Actively recruit retain and develop top-tier infrastructure SRE and platform talent including succession planning for critical roles.

  • Establish a culture of engineering excellence reliability and continuous improvement grounded in data post-incident learning and blameless accountability.

  • Lead change management during periods of transformation including data center exit cloud migration and operating model shifts.

  • Foster strong partnerships with product application engineering security and business leaders ensuring platform teams are seen as strategic enablers and not service providers.

  • Champion diversity of thought inclusive leadership and high team engagement across a growing global organization.

Role Requirements

  • 15 years of experience in infrastructure Cloud SRE or platform engineering.

  • 7 years leading large engineering organizations (managers of managers or equivalent).

  • Direct hands-on leadership of at least one full data center exit and AWS migration including decommissioning of on-premise infrastructure.

  • Deep technical expertise in AWS including VPC networking EC2 EKS/Kubernetes RDS/Aurora S3 IAM and observability tooling.

  • Strong experience operating highly available distributed systems using SRE principles.

  • Proven ability to lead complex high-risk infrastructure transformations in production environments.

  • Expertise in FinOps and cloud cost optimization practices.

  • Demonstrated ability to drive standards and adoptions across distributed engineering teams without relying on reporting lines.

  • Skillful operating as a front-line executive leader during critical situations including migrations upgrades DR incidents and major production events.

At MX we are a high-performance organization that thrives on trust and results. This role is based in Lehi Utah with flexibility for both in-office and remote work. We believe in empowering our team members to deliver exceptional outcomes while taking advantage of our incredible office space when it best supports their work. Our Utah office features onsite perks such as company-paid meals massage therapists a sports simulator gym mothers lounge and meditation room and meaningful interactions with amazing people. We encourage team members to come together in the office to collaborate kick off key projects or strategize cross-functionally fostering connection and innovation.

MX is proudly committed to recruiting and retaining a diverse and inclusive workforce. As an Equal Opportunity Employer we never discriminate based on race religion color national origin gender (including pregnancy childbirth or related medical conditions) sexual orientation gender identity gender expression age military or veteran status status as an individual with a disability or other applicable legally protected characteristics. We particularly welcome applications from veterans and military spouses. All your information will be kept confidential according to EEO guidelines. You may request reasonable accommodations by sending an email to


Required Experience:

Exec

Life at MXWe are driven by our moral imperative to advance mankind - and it all starts with our people product and purpose. We always carry a deep sense of drive and passion with us. If you thrive in a challenging work environment surrounded by incredible team members who will help you grow MX is th...
View more view more

Key Skills

  • Go
  • Lean
  • Management Experience
  • React
  • Node.js
  • Operations Management
  • Project Management
  • Research & Development
  • Software Development
  • Team Management
  • GraphQL
  • Leadership Experience