HPC Kubernetes Solutions Architect (GPU Platforms)

Not Interested
Bookmark
Report This Job

profile Job Location:

Marshall County, WV - USA

profile Monthly Salary: Not Disclosed
Posted on: 13 hours ago
Vacancies: 1 Vacancy

Job Summary

HPC Kubernetes Solutions Architect (GPU Platforms)
Location: Dallas TX (Hybrid)
Type: Direct Hire

Competitive base salary performance bonus
100% company-paid benefits

Overview

We are seeking an HPC Kubernetes Solutions Architect to lead the design integration and adoption of GPU-accelerated Kubernetes platforms supporting HPC AI/ML simulation and scientific workloads.

This is a highly technical customer-facing architecture role with ownership across the full solution lifecyclefrom discovery and requirements gathering through architecture design proof-of-concept delivery deployment and long-term optimization. The role serves as a trusted advisor to customers while also influencing internal product and engineering direction through real-world feedback.

The ideal candidate brings deep expertise across Kubernetes GPU orchestration and HPC environments along with the ability to design scalable high-performance platforms and guide customers through complex infrastructure transformations.

Key Responsibilities

Customer Engagement & Architecture Leadership

Serve as the primary architectural point of contact for customers adopting GPU-accelerated Kubernetes platforms
Capture workload requirements performance objectives and scaling needs translating them into reference architectures and solution designs
Lead customer workshops technical design sessions and architecture reviews

Kubernetes & GPU Platform Engineering

Architect and operate Kubernetes clusters optimized for GPU workloads using NVIDIA GPU Operator Network Operator DCGM and device plugins
Integrate Multi-Instance GPU (MIG) GPU sharing and advanced scheduling (Volcano Slurm integration kube-scheduler plugins)
Design and implement multi-tenant Kubernetes environments with strong isolation and performance guarantees

Automation & Operator Development

Develop or extend custom Kubernetes operators and controllers using Go or Python
Automate HPC infrastructure services and platform operations
Support Infrastructure-as-Code and GitOps practices using Terraform Helm Kustomize ArgoCD and FluxCD

Performance Optimization & Benchmarking

Lead proof-of-concept and benchmarking initiatives to validate performance and scalability
Utilize profiling tools and workload characterization methodologies to optimize GPU utilization and cluster performance
Conduct performance tuning across compute storage and networking layers

Integration & Infrastructure Design

Define integration strategies across compute storage networking and orchestration layers
Support CNI integrations (NVIDIA CNI Multus Cilium) distributed storage (Lustre GPFS Ceph VAST) and container runtimes
Ensure seamless integration with HPC schedulers and enterprise systems

Observability & Monitoring

Implement monitoring and telemetry solutions using Prometheus Grafana DCGM Exporter and OpenTelemetry
Provide visibility into GPU health cluster utilization and workload performance

Cross-Functional Collaboration

Partner with HPC ML DevOps and platform teams to ensure scalability and performance in hybrid and on-prem environments
Collaborate with product and engineering teams to influence roadmap and platform improvements
Build relationships with ecosystem vendors including NVIDIA networking providers and storage partners

Innovation & Thought Leadership

Stay current on GPU roadmaps interconnect technologies (InfiniBand RoCE NVLink) and Kubernetes advancements
Provide forward-looking guidance to customers on scaling and future architecture evolution
Represent the organization in technical workshops design sessions and industry events

Required Experience

Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments
Deep expertise across:

  • NVIDIA GPU ecosystem (GPU Operator device plugins MIG NVML DCGM)
  • Kubernetes internals (CRDs RBAC scheduler extensions custom operators/controllers)
  • High-performance networking (InfiniBand RDMA RoCE)
  • Distributed storage integration for HPC workloads

Proven ability to design scalable secure and resilient Kubernetes-based architectures
Proficiency in Go or Python for operator development and automation
Experience with workload profiling benchmarking and performance tuning
Strong customer-facing skills with the ability to translate requirements into actionable architectures
Experience collaborating across engineering product and operations teams

Preferred Experience

Experience delivering end-to-end HPC or AI/ML solutions from design through deployment and optimization
Familiarity with containerized HPC environments (e.g. Singularity/Apptainer)
Experience with GitOps practices and CI/CD pipelines for Kubernetes platforms
Contributions to open-source projects in Kubernetes or NVIDIA ecosystems
Experience advising customers on future-state architectures and emerging technologies
Bachelors or Masters degree in Computer Science Engineering Physics or related field
Relevant certifications such as CKA CKAD CKS AWS Solutions Architect or Azure Solutions Architect Expert

HPC Kubernetes Solutions Architect (GPU Platforms) Location: Dallas TX (Hybrid) Type: Direct Hire Competitive base salary performance bonus 100% company-paid benefitsOverviewWe are seeking an HPC Kubernetes Solutions Architect to lead the design integration and adoption of GPU-accelerated Kubernet...
View more view more