Engineering Manager – Kubernetes Platform (AI Distributed Compute)

Not Interested
Bookmark
Report This Job

profile Job Location:

Marshall County, WV - USA

profile Monthly Salary: Not Disclosed
Posted on: 8 hours ago
Vacancies: 1 Vacancy

Job Summary

Engineering Manager Kubernetes Platform (AI / Distributed Compute)
Location: Dallas TX (Hybrid)

Competitive base salary performance bonus
100% company-paid benefits


Overview

We are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting compute-intensive workloads across distributed environments.

This role blends technical leadership with hands-on platform expertise focusing on building highly reliable high-performance infrastructure that supports advanced data processing AI/ML workloads and large-scale compute operations. You will play a key role in shaping the platform strategy guiding architectural decisions and driving continuous improvement across performance scalability and automation.

The ideal candidate is a strong technical leader with experience managing engineering teams while remaining close to system design and platform engineering challenges.


Key Responsibilities

Team Leadership & Technical Direction
Lead mentor and grow a team of engineers responsible for platform development and operations
Define technical direction roadmap and best practices across platform engineering initiatives
Provide hands-on guidance in system design performance optimization and infrastructure strategy

Platform Architecture & Performance
Design and evolve Kubernetes-based infrastructure supporting high-throughput distributed workloads
Optimize resource allocation workload scheduling and system performance across shared compute environments
Ensure platform scalability reliability and efficient utilization of compute resources

Automation & Reliability
Drive automation across infrastructure and platform operations using Infrastructure-as-Code and CI/CD practices
Establish and enhance observability monitoring and incident response processes
Define and track key performance and reliability metrics across large-scale environments

Cross-Functional Collaboration
Partner with engineering data and infrastructure teams to integrate storage networking and compute systems
Collaborate on system design decisions involving distributed storage high-speed networking and runtime environments
Engage with external partners and vendors to improve tooling and platform capabilities

Capacity Planning & Operations
Oversee platform health capacity planning and long-term scalability across distributed infrastructure
Ensure operational readiness for high-demand workloads and evolving system requirements


Required Experience

7 years of experience in platform engineering infrastructure engineering or SRE environments
2 years of experience leading or managing engineering teams
Strong experience operating Kubernetes in large-scale production environments
Experience supporting compute-intensive workloads (e.g. AI/ML data processing or distributed systems)
Deep understanding of Linux systems networking fundamentals and performance optimization
Experience working with shared multi-tenant infrastructure environments
Hands-on experience with Infrastructure-as-Code tools (e.g. Terraform Ansible)
Familiarity with observability and monitoring tools (e.g. Prometheus Grafana logging platforms)
Strong communication skills with the ability to align technical execution with business objectives


Preferred Experience

Familiarity with workload orchestration or scheduling frameworks (e.g. Slurm or similar)
Experience with container runtimes such as containerd or CRI-O
Exposure to distributed storage systems or high-performance networking concepts
Contributions to open-source projects within Kubernetes infrastructure or AI/ML ecosystems

Engineering Manager Kubernetes Platform (AI / Distributed Compute) Location: Dallas TX (Hybrid) Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting c...
View more view more