We are seeking an experienced AI/ML Ops Engineer with strong expertise in NVIDIA GPU environments AI/ML infrastructure and sanity testing/validation processes. The ideal candidate will support deployment monitoring and operational validation of AI/ML workloads while ensuring system performance stability and reliability across GPU-based platforms.
Key Responsibilities
Perform sanity checks and validation for AI/ML models pipelines and GPU environments.
Manage and optimize NVIDIA GPU-based AI/ML infrastructure.
Monitor AI/ML workloads troubleshoot issues and ensure high system availability.
Work with MLOps tools for deployment automation and CI/CD processes.
Collaborate with AI engineers DevOps and infrastructure teams for production support.
Analyze logs performance metrics and system behavior to identify bottlenecks.
Required Skills
Strong experience in AI/ML Ops or MLOps environments.
Experience with Sanity testing.
Hands-on experience with NVIDIA GPUs CUDA TensorRT or related technologies.
Knowledge of Kubernetes Docker Linux and cloud platforms (AWS/Azure/GCP).
Experience with Python scripting and automation tools.
Familiarity with monitoring testing and sanity validation processes for AI systems.
Experience with ML model deployment and performance tuning.
Understanding of CI/CD pipelines and infrastructure automation.
Primary Skills
GCP
Azure
Terraform
Kubernetes
Python
GenAI Platforms
Arize AI
Claude Cowork
HashiCorp Vault
LLMs
RAG
Job Title: AI/ML Ops Engineer Sanity Check (NVIDIA) Location: Charlotte NC (Onsite) Job Summary We are seeking an experienced AI/ML Ops Engineer with strong expertise in NVIDIA GPU environments AI/ML infrastructure and sanity testing/validation processes. The ideal candidate will support depl...
We are seeking an experienced AI/ML Ops Engineer with strong expertise in NVIDIA GPU environments AI/ML infrastructure and sanity testing/validation processes. The ideal candidate will support deployment monitoring and operational validation of AI/ML workloads while ensuring system performance stability and reliability across GPU-based platforms.
Key Responsibilities
Perform sanity checks and validation for AI/ML models pipelines and GPU environments.
Manage and optimize NVIDIA GPU-based AI/ML infrastructure.
Monitor AI/ML workloads troubleshoot issues and ensure high system availability.
Work with MLOps tools for deployment automation and CI/CD processes.
Collaborate with AI engineers DevOps and infrastructure teams for production support.
Analyze logs performance metrics and system behavior to identify bottlenecks.
Required Skills
Strong experience in AI/ML Ops or MLOps environments.
Experience with Sanity testing.
Hands-on experience with NVIDIA GPUs CUDA TensorRT or related technologies.
Knowledge of Kubernetes Docker Linux and cloud platforms (AWS/Azure/GCP).
Experience with Python scripting and automation tools.
Familiarity with monitoring testing and sanity validation processes for AI systems.
Experience with ML model deployment and performance tuning.
Understanding of CI/CD pipelines and infrastructure automation.