Inference Engineer

Techire Ai

Job Location:

San Francisco, CA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

Job Description

Machine Learning Engineer Inference

Want to solve realtime inference problems where milliseconds genuinely matter

This role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Their systems power enterprise voice experiences used at massive scale across customer support ordering and conversational automation.

This is not another generic AI platform role focused on wrapping APIs or building dashboards.

The work here sits deep in the runtime stack optimising realtime speech systems under production latency constraints. Think streaming inference scheduler design GPU utilisation concurrency optimisation dynamic batching and making state-of-the-art speech models actually behave correctly in realtime environments.

Youll join a lean engineering team working directly on the inference systems behind low-latency conversational speech models. The challenge is not simply generating outputs its generating speech naturally reliably and fast enough for real human interaction.

Your work will include:

Building and optimising realtime TTS streaming infrastructure
Improving scheduler and batching systems for production workloads
Reducing TTFA/TTFB while maintaining speech quality and stability
GPU profiling and identifying kernel-level bottlenecks
Optimising TensorRT Triton ONNX Runtime and custom serving systems
Managing KV cache systems speculative decoding and streaming inference
Supporting heterogeneous deployment environments across NVIDIA and AMD GPUs
Collaborating closely with model researchers to productionise cutting-edge speech systems

A large part of the role involves solving difficult runtime problems where latency consistency concurrency and throughput directly impact user experience. The team already operates beyond the performance of most publicly available realtime speech systems but theres still substantial room to push the infrastructure further.

Youll likely have strong depth across inference systems runtime optimisation distributed serving or GPU performance engineering. Experience with tools like TensorRT Triton vLLM CUDA Graphs ONNX Runtime or custom schedulers would be highly valuable.

The environment suits engineers who naturally investigate bottlenecks enjoy working close to hardware constraints and care deeply about performance engineering. If reducing latency by 30ms feels meaningful youll probably enjoy this team.

The stack includes Rust C Python CUDA TensorRT Triton Kubernetes AWS and custom realtime inference infrastructure.

Compensation is highly competitive and flexible depending on experience including strong salary equity and benefits.

Location: Remote across the US or Europe.

If youre excited by realtime AI systems problems where optimisation work directly shapes production performance at scale this would be worth exploring.

All applicants will receive a response.

Required Experience:

Job DescriptionMachine Learning Engineer InferenceWant to solve realtime inference problems where milliseconds genuinely matterThis role is with a fast-growing voice AI company building the realtime speech infrastructure layer behind hundreds of millions of production conversations every month. Thei...