Sr Director of Software Engineering- AI Infrastructure Platform
Palo Alto, CA - USA
Job Summary
Your opportunity to make a real impact and shape the future of financial services is waiting for you. Lets push the boundaries of whats possible together.
As a Senior Director of Software Engineering at JPMorganChase within the firmwide AI Infrastructure Platform organization you will lead multiple technical areas and manage the activities of multiple departments responsible for delivering a unified AI infrastructure layer across onpremises environments public cloud and emerging acceleratedcompute vendors. You will collaborate across AI/ML engineering infrastructure security and controls and vendor teams to ensure the firm remains at the forefront of AI platform capabilities operational excellence and industry best practices.
In this role you will own training and experimentation on a Kubernetesstandardized platform. While a dedicated architecture function exists you will act as an active design partnerguiding architectural tradeoffs and ensuring designs translate into reliable secure and operable systems at enterprise scale.
Job responsibilities
- Lead multiple technology and platform implementations across departments to deliver firmwide AI infrastructure objectives with a primary focus on training and experimentation platforms operating at enterprise scale.
- Own the design delivery and evolution of a Kubernetesfirst training and experimentation platform including Kubernetesnative support for batch and distributed training jobs lifecycle management retry semantics and failure recovery patterns.
- Standardize AI developer workflows for experimentation enabling selfservice job submission reusable templates and golden paths reproducibility mechanisms and consistent runtime behavior across hybrid deployment environments.
- Build and evolve platform APIs and automation including Kubernetes controllers and operators where appropriate to ensure the platform is safe scalable and easy to adopt across teams.
- Drive measurable improvements in GPU availability and utilization through reliability engineering fleet readiness patterns and accelerated capacity onboarding.
Define and implement governancebased scheduling and placement strategies including:
Multitenant GPU quotas and guardrails
Priority admission control and reservation patterns
Preemption policies
Fragmentation reduction and topologyaware placement (GPU type MIG and topology awareness)
- Embed enterprisegrade security risk and control requirements into platform defaults including IAM and RBAC controls secrets management audit logging policy enforcement network segmentation and controlled change management.
- Drive operational excellence by establishing SLIs and SLOs managing error budgets leading incident management practices forecasting capacity and delivering endtoend platform observability across job lifecycles and GPU telemetry.
- Act as the primary interface with senior leaders stakeholders and executives driving alignment and consensus across competing priorities and complex initiatives.
Lead multiple engineering teams and managers building a highperforming organization with strong engineering standards scalable operating models and a culture of accountability and continuous improvement.
- Champion the firms culture of diversity opportunity inclusion and respect.
Required qualifications capabilities and skills
- 15 years of engineering experience including 8 years of senior engineering leadership experience with responsibility for managing managers.
- Demonstrated experience delivering platform products (beyond foundational infrastructure) with strong adoption reliability and operational maturity.
- Experience developing and leading large crossfunctional engineering teams within highly matrixed and complex enterprise environments.
- Proven track record of leading complex initiatives supporting distributed system design testing and operational stability at scale.
Deep handson expertise with Kubernetesbased platforms including:
Multitenancy RBAC admission control and network policy
Multicluster operations upgrades and cluster lifecycle management
Controllers operators (CRDs) and platform API design patterns
Experience supporting AI training and experimentation platforms including:
PyTorch and distributed training concepts such as scaling orchestration and failure modes
Ray or similar frameworks for distributed experimentation execution
Familiarity with Slurm or equivalent HPC or batch schedulers and core concepts such as queues fairshare reservations and preemption
- Understanding of modern AI inference stacks (for example vLLM) and how serving constraintslatency throughput batching KV cache behavior and GPU memory limitsinfluence training and experimentation platform design.
- Strong understanding of GPU infrastructure fundamentals including NVIDIA ecosystem capabilities health and telemetry signals and scheduling and placement constraints.
- Extensive practical experience with cloudnative technologies and hybrid infrastructure environments spanning onpremises and public cloud.
- Experience hiring developing coaching and retaining highperforming engineering talent.
Preferred qualifications capabilities and skills
- Experience operating largescale GPU fleets including heterogeneous accelerator environments.
- Experience delivering hybrid AI platforms across onpremises infrastructure public cloud and specialized acceleratedcompute vendors.
- Experience working at the code level within largescale distributed systems.
- This position is subject to Section 19 of the Federal Deposit Insurance Act. As such an employment offer for this position is contingent on JPMorganChases review of criminal conviction history including pretrial diversions or program entries.
Required Experience:
Exec
About Company
JPMorganChase, one of the oldest financial institutions, offers innovative financial solutions to millions of consumers, small businesses and many of the world’s most prominent corporate, institutional and government clients under the J.P. Morgan and Chase brands. Our history spans ov ... View more