On-prem Platform Engineer
Location: Charlotte NC
type of hire: C2C
positions :: 3
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
- vLLM TensorRT-LLM Triton Inference Server SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA NCCL GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe OpenShift AI
- Helm charts Operators platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus Grafana
- ML observability (model latency throughput drift resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills
- Experience with LLMOps / GenAI pipelines
- Exposure to hybrid cloud (on-prem GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).
Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.
Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.
Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.
Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services.
Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases.
On-prem Platform Engineer Location: Charlotte NC type of hire: C2C positions :: 3 Key Skills: Must-Have Skills (Mandatory Keywords) LLM Inference & Optimization vLLM TensorRT-LLM Triton Inference Server SGLang Inference optimization techniques: Continuous batching Speculative decoding KV ca...
On-prem Platform Engineer
Location: Charlotte NC
type of hire: C2C
positions :: 3
Key Skills:
Must-Have Skills (Mandatory Keywords)
LLM Inference & Optimization
- vLLM TensorRT-LLM Triton Inference Server SGLang
- Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
- Model optimization:
Distributed & GPU Systems
- Tensor parallelism and large model scaling
- CUDA NCCL GPU architecture
- GPU partitioning & optimization (MIG)
Kubernetes & ML Serving
- Kubernetes-based ML serving platforms
- KServe OpenShift AI
- Helm charts Operators platform automation
GPU Orchestration
- Run:AI or similar GPU scheduling/orchestration platforms
- Multi-tenant GPU workload management
Platform Engineering
- Experience building internal AI/ML platforms (on-prem or hybrid)
- Strong automation and system design mindset
Observability & Performance
- Prometheus Grafana
- ML observability (model latency throughput drift resource utilization)
- Performance benchmarking and tuning
Good to Have / Preferred Skills
- Experience with LLMOps / GenAI pipelines
- Exposure to hybrid cloud (on-prem GCP/Azure integration)
- Familiarity with Inferentia / alternative accelerators
- Knowledge of service mesh / networking in GPU clusters
Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.
Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).
Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.
Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.
Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.
Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services.
Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases.
View more
View less