Senior Kubernetes Engineer (GPU AI Platforms)

Not Interested
Bookmark
Report This Job

profile Job Location:

Marshall County, WV - USA

profile Monthly Salary: Not Disclosed
Posted on: 16 hours ago
Vacancies: 1 Vacancy

Job Summary

Senior Kubernetes Engineer (GPU / AI Platforms)
Location: Dallas TX (Hybrid)
Type: Direct Hire

Competitive base salary performance bonus
100% company-paid benefits

Overview

We are seeking a Senior Kubernetes Engineer to design implement and optimize GPU-accelerated container platforms at scale within a high-performance computing environment.

This role focuses on enabling AI/ML HPC and large-scale training workloads across hybrid and on-prem infrastructure. The position requires deep expertise across both Kubernetes and NVIDIA ecosystems with a strong emphasis on GPU scheduling performance optimization and platform automation.

The ideal candidate brings hands-on experience building production-grade Kubernetes platforms for GPU-intensive workloads along with strong development skills and a passion for scalable high-performance infrastructure.

Key Responsibilities

Kubernetes Platform Engineering

Architect and operate Kubernetes clusters optimized for GPU workloads
Leverage NVIDIA GPU Operator Network Operator and DCGM for cluster performance and observability
Ensure platform scalability reliability and performance for high-throughput workloads

GPU Enablement & Scheduling

Integrate NVIDIA device plugins Multi-Instance GPU (MIG) and GPU sharing capabilities into Kubernetes scheduling
Optimize GPU utilization and workload placement using scheduler extensions such as kube-scheduler plugins Slurm and Volcano
Support GPU-intensive workloads including LLM training AI/ML pipelines and scientific computing

Automation & Operator Development

Develop deploy and maintain custom Kubernetes operators and controllers
Automate infrastructure services and platform operations using Go or Python
Contribute to Infrastructure-as-Code practices using Terraform Helm and Kustomize

Observability & Performance

Implement monitoring and telemetry solutions using Prometheus Grafana DCGM Exporter and OpenTelemetry
Drive performance tuning and capacity optimization across GPU-enabled clusters
Participate in incident response and production readiness reviews

Security & Multi-Tenancy

Implement secure multi-tenant environments with RBAC and policy enforcement (OPA Gatekeeper)
Ensure proper isolation across users namespaces and workloads

DevOps & CI/CD

Maintain and enhance CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD
Support continuous deployment and lifecycle management of Kubernetes infrastructure

Cross-Functional Collaboration

Partner with HPC ML DevOps and platform engineering teams to support high-performance workloads
Collaborate on infrastructure design optimization and operational best practices

Required Experience

Extensive experience operating Kubernetes in production-grade environments
Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator device plugins NVML MIG and DCGM
Strong understanding of Kubernetes internals including CRDs RBAC custom controllers and scheduler extensions
Proficiency in Go or Python for operator development and automation
Experience supporting GPU-intensive workloads such as LLM training AI/ML pipelines or HPC workloads
Hands-on experience with Helm Kustomize and GitOps workflows

Technical Skills

Experience with Prometheus Grafana DCGM Exporter and OpenTelemetry for monitoring and observability
Familiarity with CNI plugins including NVIDIA CNI and Multus
Experience with Infrastructure-as-Code tools such as Terraform
Knowledge of CI/CD pipelines and Git-based workflows

Preferred Experience

Experience with container runtimes such as containerd CRI-O and NVIDIA Container Toolkit
Exposure to Cilium or advanced CNI networking solutions
Contributions to open-source projects within Kubernetes or NVIDIA ecosystems
Experience working in HPC or large-scale AI infrastructure environments

Senior Kubernetes Engineer (GPU / AI Platforms) Location: Dallas TX (Hybrid) Type: Direct Hire Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking a Senior Kubernetes Engineer to design implement and optimize GPU-accelerated container platforms at scale withi...
View more view more