On-prem Platform Engineer

TalentOla

Job Location:

Charlotte, NC - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Role :: On-prem Platform Engineer

Location: Charlotte NC

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

vLLM TensorRT-LLM Triton Inference Server SGLang
Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
Model optimization:
- FP8 AWQ GPTQ

Distributed & GPU Systems

Tensor parallelism and large model scaling
CUDA NCCL GPU architecture
GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

Kubernetes-based ML serving platforms
KServe OpenShift AI
Helm charts Operators platform automation

GPU Orchestration

Run:AI or similar GPU scheduling/orchestration platforms
Multi-tenant GPU workload management

Platform Engineering

Experience building internal AI/ML platforms (on-prem or hybrid)
Strong automation and system design mindset

Observability & Performance

Prometheus Grafana
ML observability (model latency throughput drift resource utilization)
Performance benchmarking and tuning

Good to Have / Preferred Skills

Experience with LLMOps / GenAI pipelines
Exposure to hybrid cloud (on-prem GCP/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters

Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.

Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).

Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.

Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.

Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.

Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services.

Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases.

Role :: On-prem Platform Engineer Location: Charlotte NC Key Skills: Must-Have Skills (Mandatory Keywords) LLM Inference & Optimization vLLM TensorRT-LLM Triton Inference Server SGLang Inference optimization techniques: Continuous batching Speculative decoding KV cache / Prefix caching ...

Role :: On-prem Platform Engineer

Location: Charlotte NC

Key Skills:

Must-Have Skills (Mandatory Keywords)

LLM Inference & Optimization

vLLM TensorRT-LLM Triton Inference Server SGLang
Inference optimization techniques:
- Continuous batching
- Speculative decoding
- KV cache / Prefix caching
Model optimization:
- FP8 AWQ GPTQ

Distributed & GPU Systems

Tensor parallelism and large model scaling
CUDA NCCL GPU architecture
GPU partitioning & optimization (MIG)

Kubernetes & ML Serving

Kubernetes-based ML serving platforms
KServe OpenShift AI
Helm charts Operators platform automation

GPU Orchestration

Run:AI or similar GPU scheduling/orchestration platforms
Multi-tenant GPU workload management

Platform Engineering

Experience building internal AI/ML platforms (on-prem or hybrid)
Strong automation and system design mindset

Observability & Performance

Prometheus Grafana
ML observability (model latency throughput drift resource utilization)
Performance benchmarking and tuning

Good to Have / Preferred Skills

Experience with LLMOps / GenAI pipelines
Exposure to hybrid cloud (on-prem GCP/Azure integration)
Familiarity with Inferentia / alternative accelerators
Knowledge of service mesh / networking in GPU clusters

Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.

Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching).

Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput.

Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving.

Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust.

Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services.

Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases.

Apply Now

About Company

TalentOla

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

On-prem Platform Engineer

Charlotte, NC - USA

Job Summary

About Company

Related Jobs