Engineer, Inference & Model serving

Techire Ai

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 220000 - 320000

Posted on: 12 hours ago

Vacancies: 1 Vacancy

Job Summary

Job Description

ML Model Serving Engineer

Want to build the layer that actually makes AI usable in real time

Youll join a team focused on inference where performance is the product. This is about delivering low-latency high-throughput systems across LLMs speech and vision models running in production not offline experiments.

Theyre building real-time AI systems that need to respond instantly reliably and at scale. That means solving hard problems around batching GPU efficiency memory constraints and system-level bottlenecks that most teams never fully crack.

Youll sit at the core of the platform working across model serving infrastructure and performance optimisation. A big part of the role is pushing current tooling beyond its limits extending frameworks profiling bottlenecks and designing systems that hold up under real-world load.

This is not about training models. Its about making them fast efficient and production-ready.

What youll work on:

Building high-performance serving systems for LLM speech and vision models
Scaling inference to production workloads with strict latency requirements
Optimising GPU utilisation and execution efficiency
Implementing techniques like continuous batching KV cache optimisation speculative decoding and prefill/decode separation
Improving frameworks such as vLLM TensorRT-LLM Triton and SGLang
Profiling and debugging performance across GPU memory and system layers

What youll bring:

Strong experience with ML inference or model serving systems
ID: 34247
Copilot Symbol
Access Evo Actions
Engineer Inference & Model serving
Sesame AI
Job ID: 34247
Applications
57
Shortlisted
4
Sent
11
1st Interview
13
2nd Interview
0
Offers
0
Placed
0
Renewal
0
Details Custom Fields Descriptions & Ratings Compensation & Fees Activities Files Onboarding Approval process Shift Setting Integrations
Upload JD
No file chosen
Original document
Job Summary
Public job description
Internal job description
Ratings & Screening questions
Note: This JD will be posted to job boards; please remember to remove the Company details and Contact information.
Quick Post Job
Job title
Engineer Inference & Model serving
Job owner: Marc Powell
Company: Sesame AI
Contact: Brown Ryan
Privacy
Only Public Jobs can be shared
Private Public
Apps
Visit the App Store
indeed
Your job will go live on Indeed once it adheres to their quality standards.
For more information on this please head to our Help Center
Your changes have been saved successfully.
Deep understanding of latency and throughput optimisation in production
Solid Python and PyTorch skills plus a systems or performance engineering mindset
Familiarity with distributed systems and production infrastructure

Exposure to CUDA GPU profiling tools or systems like Kubernetes and Ray is useful but the key is knowing how to make models run efficiently at scale.

Youll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic focused on solving real performance problems rather than abstract research.

Theres real ownership here. Youll help define how next-generation AI systems are served.

Package:
$220000 $320000 base equity
San Francisco onsite 3 days per week

If youre interested in working on the part of AI that actually determines whether it works in the real world this is worth exploring.

All applicants will receive a response.

Required Experience:

Job DescriptionML Model Serving EngineerWant to build the layer that actually makes AI usable in real timeYoull join a team focused on inference where performance is the product. This is about delivering low-latency high-throughput systems across LLMs speech and vision models running in production n...