Senior Kubernetes Engineer
Location: Dallas TX
Overview
This organization is backed by dedicated leadership and investment with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients research production and delivery enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research simulations analysis and decision-making accelerating discovery and driving faster innovation.
We are seeking a highly skilled Senior Kubernetes Engineer to join our office in this role you will design implement and optimise GPU-accelerated container platforms at scale enabling high-performance workloads (AI/ML HPC LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems including GPU scheduling device plugins and custom operators.
Key Responsibilities
- Architect and operate Kubernetes clusters optimised for GPU workloads leveraging NVIDIA GPU Operator Network Operator and DCGM.
- Develop deploy and maintain custom Kubernetes operators and controllers to automate infrastructure services.
- Integrate NVIDIA device plugins Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer.
- Optimise GPU utilisation and job placement through scheduler extensions such as kube-scheduler plugins Slurm and Volcano.
- Collaborate with HPC ML and DevOps teams to ensure multi-tenant high-throughput cluster performance.
- Drive observability and telemetry integrations using Prometheus Grafana DCGM Exporter and OpenTelemetry.
- Implement secure multi-user and multi-namespace GPU isolation with RBAC and policy enforcement such as OPA or Gatekeeper.
- Maintain CI/CD pipelines for Kubernetes infrastructure using GitOps ArgoCD and FluxCD.
- Contribute to infrastructure-as-code using Terraform Helm and Kustomize.
- Participate in performance tuning incident response and production readiness reviews.
Required Experience
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes including GPU Operator device plugin NVML MIG and DCGM.
- Proficiency in Go or Python for operator development and Kubernetes controller logic.
- Deep understanding of Kubernetes internals including CRDs RBAC custom controllers and scheduler extensions.
- Experience with GPU-intensive workloads for example for LLMs training pipelines and scientific computing.
- Hands-on experience with Helm Kustomize and GitOps workflows.
- Familiarity with CNI plugins especially NVIDIA CNI and Multus.
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter.
Senior Kubernetes EngineerLocation: Dallas TXOverviewThis organization is backed by dedicated leadership and investment with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports cli...
Senior Kubernetes Engineer
Location: Dallas TX
Overview
This organization is backed by dedicated leadership and investment with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance high-performance computing (HPC) and cloud infrastructure that supports clients research production and delivery enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research simulations analysis and decision-making accelerating discovery and driving faster innovation.
We are seeking a highly skilled Senior Kubernetes Engineer to join our office in this role you will design implement and optimise GPU-accelerated container platforms at scale enabling high-performance workloads (AI/ML HPC LLM training) across hybrid or on-prem environments. You will have deep expertise with both NVIDIA and Kubernetes ecosystems including GPU scheduling device plugins and custom operators.
Key Responsibilities
- Architect and operate Kubernetes clusters optimised for GPU workloads leveraging NVIDIA GPU Operator Network Operator and DCGM.
- Develop deploy and maintain custom Kubernetes operators and controllers to automate infrastructure services.
- Integrate NVIDIA device plugins Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer.
- Optimise GPU utilisation and job placement through scheduler extensions such as kube-scheduler plugins Slurm and Volcano.
- Collaborate with HPC ML and DevOps teams to ensure multi-tenant high-throughput cluster performance.
- Drive observability and telemetry integrations using Prometheus Grafana DCGM Exporter and OpenTelemetry.
- Implement secure multi-user and multi-namespace GPU isolation with RBAC and policy enforcement such as OPA or Gatekeeper.
- Maintain CI/CD pipelines for Kubernetes infrastructure using GitOps ArgoCD and FluxCD.
- Contribute to infrastructure-as-code using Terraform Helm and Kustomize.
- Participate in performance tuning incident response and production readiness reviews.
Required Experience
- Extensive experience with Kubernetes in production-grade environments and working with NVIDIA and Kubernetes including GPU Operator device plugin NVML MIG and DCGM.
- Proficiency in Go or Python for operator development and Kubernetes controller logic.
- Deep understanding of Kubernetes internals including CRDs RBAC custom controllers and scheduler extensions.
- Experience with GPU-intensive workloads for example for LLMs training pipelines and scientific computing.
- Hands-on experience with Helm Kustomize and GitOps workflows.
- Familiarity with CNI plugins especially NVIDIA CNI and Multus.
- Experience with monitoring GPU metrics and cluster health using Prometheus and DCGM Exporter.
View more
View less