Position: On-prem Platform Engineer
Location: Charlotte NC (Onsite)
Duration: Long Term Contract
No. of positions :: 3
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
- vLLM TensorRT-LLM Triton Inference Server SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA NCCL GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe OpenShift AI
- Helm charts Operators platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus Grafana
- ML observability (model latency throughput drift resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills:
- Experience with LLMOps / Gen-AI pipelines
- Exposure to hybrid cloud (on-prem GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
- Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
- Design and optimize high performance inference stacks using vLLM Tensor RT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).
- Manage GPU orchestration and capacity using Run: AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.
- Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.
- Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.
- Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for Gen-AI services.
- Collaborate with ML and research teams to onboard new models tune inference performance and productionize Gen-AI use cases.
Position: On-prem Platform Engineer Location: Charlotte NC (Onsite) Duration: Long Term Contract No. of positions :: 3 Key Skills: Must-Have Skills (Mandatory Keywords) LLM Inference & Optimization vLLM TensorRT-LLM Triton Inference Server SGLang Inference optimization techniques: Cont...
Position: On-prem Platform Engineer
Location: Charlotte NC (Onsite)
Duration: Long Term Contract
No. of positions :: 3
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
- vLLM TensorRT-LLM Triton Inference Server SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA NCCL GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe OpenShift AI
- Helm charts Operators platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus Grafana
- ML observability (model latency throughput drift resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills:
- Experience with LLMOps / Gen-AI pipelines
- Exposure to hybrid cloud (on-prem GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
- Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving Gen-AI models and LLM inference workloads.
- Design and optimize high performance inference stacks using vLLM Tensor RT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).
- Manage GPU orchestration and capacity using Run: AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.
- Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.
- Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.
- Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for Gen-AI services.
- Collaborate with ML and research teams to onboard new models tune inference performance and productionize Gen-AI use cases.
View more
View less