Inference Engineer
San Francisco, CA - USA
Job Summary
Job Description
Machine Learning Engineer Inference
Want to solve realtime inference problems where milliseconds genuinely matter
This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support ordering and conversational automation.
This is not another generic AI platform role focused on wrapping APIs or building dashboards.
The work here sits deep in the runtime stack optimising realtime speech systems under production latency constraints. Think streaming inference scheduler design GPU utilisation concurrency optimisation dynamic batching and making state-of-the-art speech models actually behave correctly in realtime environments.
Youll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs its generating speech naturally reliably and fast enough for real human interaction.
Your work will include:
- Building and optimising realtime TTS streaming infrastructure
- Improving scheduler and batching systems for production workloads
- Reducing TTFA/TTFB while maintaining speech quality and stability
- GPU profiling and identifying kernel-level bottlenecks
- Optimising TensorRT Triton ONNX Runtime and custom serving systems
- Managing KV cache systems speculative decoding and streaming inference
- Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs
- Collaborating closely with model researchers to productionise cutting-edge speech systems
A large part of the role involves solving difficult runtime problems where latency consistency concurrency and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems but theres still substantial room to push the infrastructure further.
Youll likely have strong depth across inference systems runtime optimisation distributed serving or GPU performance engineering. Experience with tools like TensorRT Triton vLLM CUDA Graphs ONNX Runtime or custom schedulers would be highly valuable.
The environment suits engineers who naturally investigate bottlenecks enjoy working close to hardware constraints and care deeply about performance engineering. If reducing latency by 30ms feels meaningful youll probably enjoy this team.
The stack includes Rust C Python CUDA TensorRT Triton Kubernetes AWS and custom realtime inference infrastructure.
Compensation is highly competitive and flexible depending on experience including strong salary equity and benefits.
Location: Remote across the US or Europe.
If youre excited by realtime AI systems problems where optimisation work directly shapes production performance at scale this would be worth exploring.
All applicants will receive a response.
Required Experience:
IC