Engineering Manager Kubernetes Platform (AI / Distributed Compute)
Location: Dallas TX (Hybrid)
Competitive base salary performance bonus
100% company-paid benefits
Overview
We are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting compute-intensive workloads across distributed environments.
This role blends technical leadership with hands-on platform expertise focusing on building highly reliable high-performance infrastructure that supports advanced data processing AI/ML workloads and large-scale compute operations. You will play a key role in shaping the platform strategy guiding architectural decisions and driving continuous improvement across performance scalability and automation.
The ideal candidate is a strong technical leader with experience managing engineering teams while remaining close to system design and platform engineering challenges.
Key Responsibilities
Team Leadership & Technical Direction
Lead mentor and grow a team of engineers responsible for platform development and operations
Define technical direction roadmap and best practices across platform engineering initiatives
Provide hands-on guidance in system design performance optimization and infrastructure strategy
Platform Architecture & Performance
Design and evolve Kubernetes-based infrastructure supporting high-throughput distributed workloads
Optimize resource allocation workload scheduling and system performance across shared compute environments
Ensure platform scalability reliability and efficient utilization of compute resources
Automation & Reliability
Drive automation across infrastructure and platform operations using Infrastructure-as-Code and CI/CD practices
Establish and enhance observability monitoring and incident response processes
Define and track key performance and reliability metrics across large-scale environments
Cross-Functional Collaboration
Partner with engineering data and infrastructure teams to integrate storage networking and compute systems
Collaborate on system design decisions involving distributed storage high-speed networking and runtime environments
Engage with external partners and vendors to improve tooling and platform capabilities
Capacity Planning & Operations
Oversee platform health capacity planning and long-term scalability across distributed infrastructure
Ensure operational readiness for high-demand workloads and evolving system requirements
Required Experience
7 years of experience in platform engineering infrastructure engineering or SRE environments
2 years of experience leading or managing engineering teams
Strong experience operating Kubernetes in large-scale production environments
Experience supporting compute-intensive workloads (e.g. AI/ML data processing or distributed systems)
Deep understanding of Linux systems networking fundamentals and performance optimization
Experience working with shared multi-tenant infrastructure environments
Hands-on experience with Infrastructure-as-Code tools (e.g. Terraform Ansible)
Familiarity with observability and monitoring tools (e.g. Prometheus Grafana logging platforms)
Strong communication skills with the ability to align technical execution with business objectives
Preferred Experience
Familiarity with workload orchestration or scheduling frameworks (e.g. Slurm or similar)
Experience with container runtimes such as containerd or CRI-O
Exposure to distributed storage systems or high-performance networking concepts
Contributions to open-source projects within Kubernetes infrastructure or AI/ML ecosystems
Engineering Manager Kubernetes Platform (AI / Distributed Compute) Location: Dallas TX (Hybrid) Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting c...
Engineering Manager Kubernetes Platform (AI / Distributed Compute)
Location: Dallas TX (Hybrid)
Competitive base salary performance bonus
100% company-paid benefits
Overview
We are seeking an Engineering Manager to lead the development and evolution of a large-scale Kubernetes platform supporting compute-intensive workloads across distributed environments.
This role blends technical leadership with hands-on platform expertise focusing on building highly reliable high-performance infrastructure that supports advanced data processing AI/ML workloads and large-scale compute operations. You will play a key role in shaping the platform strategy guiding architectural decisions and driving continuous improvement across performance scalability and automation.
The ideal candidate is a strong technical leader with experience managing engineering teams while remaining close to system design and platform engineering challenges.
Key Responsibilities
Team Leadership & Technical Direction
Lead mentor and grow a team of engineers responsible for platform development and operations
Define technical direction roadmap and best practices across platform engineering initiatives
Provide hands-on guidance in system design performance optimization and infrastructure strategy
Platform Architecture & Performance
Design and evolve Kubernetes-based infrastructure supporting high-throughput distributed workloads
Optimize resource allocation workload scheduling and system performance across shared compute environments
Ensure platform scalability reliability and efficient utilization of compute resources
Automation & Reliability
Drive automation across infrastructure and platform operations using Infrastructure-as-Code and CI/CD practices
Establish and enhance observability monitoring and incident response processes
Define and track key performance and reliability metrics across large-scale environments
Cross-Functional Collaboration
Partner with engineering data and infrastructure teams to integrate storage networking and compute systems
Collaborate on system design decisions involving distributed storage high-speed networking and runtime environments
Engage with external partners and vendors to improve tooling and platform capabilities
Capacity Planning & Operations
Oversee platform health capacity planning and long-term scalability across distributed infrastructure
Ensure operational readiness for high-demand workloads and evolving system requirements
Required Experience
7 years of experience in platform engineering infrastructure engineering or SRE environments
2 years of experience leading or managing engineering teams
Strong experience operating Kubernetes in large-scale production environments
Experience supporting compute-intensive workloads (e.g. AI/ML data processing or distributed systems)
Deep understanding of Linux systems networking fundamentals and performance optimization
Experience working with shared multi-tenant infrastructure environments
Hands-on experience with Infrastructure-as-Code tools (e.g. Terraform Ansible)
Familiarity with observability and monitoring tools (e.g. Prometheus Grafana logging platforms)
Strong communication skills with the ability to align technical execution with business objectives
Preferred Experience
Familiarity with workload orchestration or scheduling frameworks (e.g. Slurm or similar)
Experience with container runtimes such as containerd or CRI-O
Exposure to distributed storage systems or high-performance networking concepts
Contributions to open-source projects within Kubernetes infrastructure or AI/ML ecosystems
View more
View less