On-prem Platform Engineer

Realtech Services

Not Interested
Bookmark
Report This Job

profile Job Location:

Charlotte, VT - USA

profile Monthly Salary: Not Disclosed
Posted on: 1 hour ago
Vacancies: 1 Vacancy

Job Summary

Position: On-prem Platform Engineer

Location: Charlotte NC (Onsite)

Duration: Long Term Contract

No. of positions :: 3

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

  • vLLM TensorRT-LLM Triton Inference Server SGLang
  • Inference optimization techniques:
    • Continuous batching
    • Speculative decoding
    • KV cache / Prefix caching
  • Model optimization:
    • FP8 AWQ GPTQ

Distributed & GPU Systems

  • Tensor parallelism and large model scaling
  • CUDA NCCL GPU architecture
  • GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

  • Kubernetes-based ML serving platforms
  • KServe OpenShift AI
  • Helm charts Operators platform automation

GPU Orchestration

  • Run:AI or similar GPU scheduling/orchestration platforms
  • Multi-tenant GPU workload management

Platform Engineering

  • Experience building internal AI/ML platforms (on-prem or hybrid)
  • Strong automation and system design mindset

Observability & Performance

  • Prometheus Grafana
  • ML observability (model latency throughput drift resource utilization)
  • Performance benchmarking and tuning

Good to Have / Preferred Skills:

  • Experience with LLMOps / Gen-AI pipelines
  • Exposure to hybrid cloud (on-prem GCP/Azure integration)
  • Familiarity with Inferentia / alternative accelerators
  • Knowledge of service mesh / networking in GPU clusters
  • Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
  • Design and optimize high performance inference stacks using vLLM Tensor RT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).
  • Manage GPU orchestration and capacity using Run: AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.
  • Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.
  • Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.
  • Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for Gen-AI services.
  • Collaborate with ML and research teams to onboard new models tune inference performance and productionize Gen-AI use cases.
Position: On-prem Platform Engineer Location: Charlotte NC (Onsite) Duration: Long Term Contract No. of positions :: 3 Key Skills: Must-Have Skills (Mandatory Keywords) LLM Inference & Optimization vLLM TensorRT-LLM Triton Inference Server SGLang Inference optimization techniques: Cont...
View more view more