AI Infrastructure Platform Engineer

VDart Inc

Not Interested
Bookmark
Report This Job

profile Job Location:

Charlotte, VT - USA

profile Monthly Salary: Not Disclosed
Posted on: 9 hours ago
Vacancies: 1 Vacancy

Job Summary

Role: AI Infrastructure Platform Engineer

Location: Charlotte NC Hybrid

Type: Contract

Description:

  • Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations.
  • Serve as a technical lead for platforms supporting AI/ML model training inference and batch workloads.
  • Design build deploy and operate OpenShift-based container platforms optimized for high-performance GPU workloads.
  • Build support and operate scalable GPU SuperPod architecture with large multi-node GPU clusters.
  • Own monitoring alerting and observability using Grafana Splunk and enterprise telemetry tools.
  • Define SLIs/SLOs and build actionable alerts to proactively detect performance capacity and resiliency risks.
  • Build AIand agent-based automation tools for self-healing scaling diagnostics and incident remediation.
  • Apply AIOps techniques to reduce alert fatigue and improve platform reliability.
  • Lead production incident analysis and ensure operational rigor and root-cause prevention.
  • Mentor engineers and influence stakeholders across a geographically distributed organization.

Required Qualifications:

  • 5 years of infrastructure engineering experience.
  • 5 years troubleshooting complex end-to-end architectures(including CI/CD pipeline).
  • 5 years Linux systems experience.
  • 4 years supporting AI/ML platforms.
  • 4 years of Kubernetes / container platform experience including production support.

Desired Qualifications:

  • Experience with Generative AI and Predictive AI platforms.
  • Hands-on GPU platform operations including scheduling quota and performance tuning.
  • Experience with OpenShift in GPU-enabled multi-tenant environments.
  • Experience designing or operating GPU Super Pods.
  • Deep experience with observability using Grafana Splunk and custom telemetry pipelines.
  • Experience building AIor agent-driven automation tooling (AIOps).
  • Hands-on experience supporting AI/ML workloads on GCP and Azure including GPU-backed services and managed AI infrastructure
  • Experience operating hybrid or multi-cloud AI platforms with an understanding of cloud-native services networking identity and cost optimization for Generative and Predictive AI
  • Strong monitoring of AI signals such as inference latency and GPU utilization.
  • Experience with BCP/DR resiliency and highly available architectures.

Job Expectations:

  • Participation in a 24x7 on-call rotation.
  • Ownership for production stability platform health and customer outcomes.
  • Operate in regulated enterprise environments with strong risk and control focus.
Role: AI Infrastructure Platform Engineer Location: Charlotte NC Hybrid Type: Contract Description: Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations. Serve as a technical lead for platforms supporting AI/ML model train...
View more view more