Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs diving deep into CUDA kernels and turning optimization techniques into production systems wed love to meet you.
About
trains and hosts specialized language models for companies that need frontier-quality AI at a fraction of the cost. The models we train match GPT-5 accuracy but are smaller faster and up to 90% cheaper. Our platform handles everything end-to-end: distillation training evaluation and planet-scale hosting.
We are a well-funded ten-person team of engineers who work in-person in downtown San Francisco on difficult high-impact engineering problems. Everyone on the team has been writing code for over 10 years and has founded and run their own software companies. We are high-agency adaptable and collaborative. We value creativity alongside technical prowess and humility. We work hard and deeply enjoy the work that we do. Most of us are in the office 4 days a week in SF; hybrid works for Bay Area candidates.
About the Role
You will be responsible for making our inference stack as fast and efficient as possible. Your work spans from implementing known optimization techniques to experimenting with novel approaches always with the goal of serving models faster and cheaper at scale.
Your north star is inference performance: latency throughput cost efficiency and how quickly we can bring new model architectures into production. Youll work across the full inference stackfrom CUDA kernels to serving frameworksto find and eliminate bottlenecks. This role reports directly to the founding team. Youll have the autonomy a large compute budget and technical support to push the limits of whats possible in model serving.
Key Responsibilities
Implement and productionize optimization techniques including quantization speculative decoding KV cache optimization continuous batching and LoRA serving
Deep dive into inference frameworks (vLLM SGLang TensorRT-LLM) and underlying libraries to debug and improve performance
Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
Add support for new model architectures ensuring they meet our performance standards before going to production
Experiment with novel inference techniques and bring successful approaches into production
Build tooling and benchmarks to measure and track inference performance across our fleet
Collaborate with applied ML engineers to ensure trained models can be served efficiently
Requirements
2 years of experience in ML systems inference optimization or GPU programming
Strong proficiency in Python and familiarity with C
Hands-on experience with LLM inference frameworks (vLLM SGLang TensorRT-LLM or similar)
Deep understanding of GPU architecture and experience profiling GPU workloads
Familiarity with LLM optimization techniques (quantization speculative decoding continuous batching KV cache management)
Experience with PyTorch and understanding of how models execute on hardware
Track record of measurably improving system performance
Nice-to-Have
Experience with CUDA programming
Familiarity with serving non-LLM models (TTS vision embeddings)
Experience with distributed inference and multi-GPU serving
Contributions to open-source inference frameworks
Experience with Docker and Kubernetes
You dont need to tick every box. Curiosity and the ability to learn quickly matter more.
Compensation
We offer competitive compensation equity in a high-growth startup and comprehensive benefits. The base salary range for this role is $220000 - $320000 plus equity and benefits depending on experience.
Equal Opportunity
is an equal opportunity employer. We welcome applicants from all backgrounds and dont discriminate based on race color religion gender sexual orientation national origin genetics disability age or veteran status.
If youre excited about making AI inference faster for everyone wed love to hear from you. Please send your resume and GitHub to and/or apply here on Ashby.
Required Experience:
Senior IC
Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs diving deep into CUDA kernels and turning optimization techniques into production systems wed love to meet you.About trains and hosts specialized language models for companies that need frontier-...
Help us make inference blazingly fast. If you love squeezing every last drop of performance out of GPUs diving deep into CUDA kernels and turning optimization techniques into production systems wed love to meet you.
About
trains and hosts specialized language models for companies that need frontier-quality AI at a fraction of the cost. The models we train match GPT-5 accuracy but are smaller faster and up to 90% cheaper. Our platform handles everything end-to-end: distillation training evaluation and planet-scale hosting.
We are a well-funded ten-person team of engineers who work in-person in downtown San Francisco on difficult high-impact engineering problems. Everyone on the team has been writing code for over 10 years and has founded and run their own software companies. We are high-agency adaptable and collaborative. We value creativity alongside technical prowess and humility. We work hard and deeply enjoy the work that we do. Most of us are in the office 4 days a week in SF; hybrid works for Bay Area candidates.
About the Role
You will be responsible for making our inference stack as fast and efficient as possible. Your work spans from implementing known optimization techniques to experimenting with novel approaches always with the goal of serving models faster and cheaper at scale.
Your north star is inference performance: latency throughput cost efficiency and how quickly we can bring new model architectures into production. Youll work across the full inference stackfrom CUDA kernels to serving frameworksto find and eliminate bottlenecks. This role reports directly to the founding team. Youll have the autonomy a large compute budget and technical support to push the limits of whats possible in model serving.
Key Responsibilities
Implement and productionize optimization techniques including quantization speculative decoding KV cache optimization continuous batching and LoRA serving
Deep dive into inference frameworks (vLLM SGLang TensorRT-LLM) and underlying libraries to debug and improve performance
Profile and optimize CUDA kernels and GPU utilization across our serving infrastructure
Add support for new model architectures ensuring they meet our performance standards before going to production
Experiment with novel inference techniques and bring successful approaches into production
Build tooling and benchmarks to measure and track inference performance across our fleet
Collaborate with applied ML engineers to ensure trained models can be served efficiently
Requirements
2 years of experience in ML systems inference optimization or GPU programming
Strong proficiency in Python and familiarity with C
Hands-on experience with LLM inference frameworks (vLLM SGLang TensorRT-LLM or similar)
Deep understanding of GPU architecture and experience profiling GPU workloads
Familiarity with LLM optimization techniques (quantization speculative decoding continuous batching KV cache management)
Experience with PyTorch and understanding of how models execute on hardware
Track record of measurably improving system performance
Nice-to-Have
Experience with CUDA programming
Familiarity with serving non-LLM models (TTS vision embeddings)
Experience with distributed inference and multi-GPU serving
Contributions to open-source inference frameworks
Experience with Docker and Kubernetes
You dont need to tick every box. Curiosity and the ability to learn quickly matter more.
Compensation
We offer competitive compensation equity in a high-growth startup and comprehensive benefits. The base salary range for this role is $220000 - $320000 plus equity and benefits depending on experience.
Equal Opportunity
is an equal opportunity employer. We welcome applicants from all backgrounds and dont discriminate based on race color religion gender sexual orientation national origin genetics disability age or veteran status.
If youre excited about making AI inference faster for everyone wed love to hear from you. Please send your resume and GitHub to and/or apply here on Ashby.
Required Experience:
Senior IC
View more
View less