| On-prem Platform Engineer | | Brevard Charlotte | Arize AI Claude Cowork GCP Terraform | vLLM TensorRT LLM Triton Inference Server SGLang Inference Optimization Continuous Batching Speculative Decoding KV Cache / Prefix Caching FP8 / AWQ / GPTQ Tensor Parallelism Kubernetes ML Serving KServe OpenShift AI Helm / Operators GPU Orchestration Run:AI Performance Benchmarking CUDA / NCCL / MIG Prometheus / Grafana ML Observability GuideLLM Locust | Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads. Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching). Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput. Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving. Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust. Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services. Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases. |
On-prem Platform Engineer Brevard Charlotte Arize AI Claude Cowork GCP Terraform vLLM TensorRT LLM Triton Inference Server SGLang Inference Optimization Continuous Batching Speculative Decoding KV Cache / Prefix Caching FP8 / AWQ / GPTQ Tensor Parallelism ...
| On-prem Platform Engineer | | Brevard Charlotte | Arize AI Claude Cowork GCP Terraform | vLLM TensorRT LLM Triton Inference Server SGLang Inference Optimization Continuous Batching Speculative Decoding KV Cache / Prefix Caching FP8 / AWQ / GPTQ Tensor Parallelism Kubernetes ML Serving KServe OpenShift AI Helm / Operators GPU Orchestration Run:AI Performance Benchmarking CUDA / NCCL / MIG Prometheus / Grafana ML Observability GuideLLM Locust | Build configure and operate on prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads. Design and optimize high performance inference stacks using vLLM TensorRT LLM Triton Inference Server SGLang and advanced techniques (continuous batching speculative decoding KV caching). Manage GPU orchestration and capacity using Run:AI MIG CUDA/NCCL and tensor parallelism to maximize utilization and throughput. Deploy and operate Kubernetes ML serving frameworks (KServe Helm Operators) for scalable reliable model serving. Drive inference optimization and benchmarking leveraging FP8 AWQ GPTQ and performance tools such as GuideLLM and Locust. Implement observability and ML monitoring using Prometheus Grafana Arize AI ensuring SLA/SLO compliance for GenAI services. Collaborate with ML and research teams to onboard new models tune inference performance and productionize GenAI use cases. |
View more
View less