Engineering Manager, HPC Kubernetes Platform

Not Interested
Bookmark
Report This Job

profile Job Location:

Marshall County, WV - USA

profile Monthly Salary: Not Disclosed
Posted on: 16 hours ago
Vacancies: 1 Vacancy

Job Summary

Engineering Manager HPC Kubernetes Platform
Location: Dallas TX

Overview

This organization is backed by dedicated leadership and investment with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients research production and delivery enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research simulations analysis and decision-making accelerating discovery and driving faster innovation.

We are seeking an experienced Engineering Manager HPC Kubernetes Platform to lead the team responsible for designing and scaling a bare-metal Kubernetes environmentthe orchestration layer powering GPU- and CPU-intensive machine-learning and HPC workloads across global datacenters.

This is a hands-on leadership role focused on platform performance reliability and automation. You will define the technical roadmap guide system architecture and optimization and ensure the Kubernetes platform delivers top-tier reliability and throughput for distributed ML and HPC environments. The ideal candidate is a strong technical leader who thrives at the intersection of infrastructure engineering AI systems and high-performance computing.

Key Responsibilities

- Lead and mentor engineers designing and scaling a bare-metal Kubernetes platform for HPC and ML workloads.
- Architect and optimize GPU/CPU scheduling resource management and performance across multi-tenant compute clusters.
- Drive automation and observability using Infrastructure-as-Code CI/CD and SRE best practices.
- Collaborate with Research Storage and Network teams to integrate distributed filesystems high-speed interconnects (InfiniBand RoCE) and custom runtimes.
- Partner with hardware and software vendors to improve tooling influence product roadmaps and streamline deployment.
- Oversee platform reliability capacity forecasting and performance KPIs across thousands of nodes.

Required Experience

- 7 years in infrastructure platform or SRE engineering including 2 in technical leadership.
- Proven experience operating Kubernetes environments tailored for HPC or ML training workloadsGPU scheduling resource isolation and workload optimization.
- Deep knowledge of Linux systems networking and performance engineering on bare-metal hardware.
- Experience managing large-scale multi-tenant clusters and integrating distributed storage or high-speed networking.
- Strong automation experience (Terraform Ansible or similar) and familiarity with observability tools (Prometheus Grafana Loki).
- Excellent communication and stakeholder management skills; ability to translate complex technical direction into clear actionable plans.
- Bachelors degree or equivalent experience.

Preferred Experience

- Familiarity with HPC schedulers (Slurm Flux) and container runtimes (containerd CRI-O).
- Contributions to open-source Kubernetes or ML infrastructure projects.

Engineering Manager HPC Kubernetes PlatformLocation: Dallas TXOverviewThis organization is backed by dedicated leadership and investment with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure ...
View more view more