Senior AI Inference Engineer Model Optimization & Deployment

Zoox

Job Location:

San Diego, CA - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.

As a Model Optimization & Deployment Engineer you will focus on bringing highly efficient production-ready large-scale models to our on-vehicle stack. We are looking for experts with hands-on experience in compressing accelerating and deploying complex models (LLMs VLMs or FMs) for power- and thermal-constrained vehicle SOCs. You will optimize the ML models write custom CUDA kernels and build highly concurrent inference code to ensure real-time deterministic execution on edge devices.

In this role you will:

Optimize large-scale models (Multi-Modal Sensor Fusion models LLMs VLMs) using advanced quantization (PTQ QAT) pruning mixed-precision inference frameworks and parameter-efficient fine-tuning (LoRA QLoRA).
Architect and implement model conversion and compilation pipelines using TensorRT for edge deployment.
Perform rigorous parity checking accuracy recovery and latency benchmarking between PyTorch frameworks and compiled edge binaries.
Develop and optimize custom ML OPs and TensorRT Plugins with efficient CUDA kernels to minimize latency and maximize memory bandwidth on AI accelerators.
Write production-level low latency and memory-safe C and CUDA code for real-time inference on vehicle systems.

Qualifications:

Deep expertise in model quantization (PTQ QAT) and mixed-precision inference frameworks (INT8 FP8 FP4 BF16/FP16).
Proven experience optimizing large-scale models (Multi-Modal Sensor Fusion models LLMs VLMs/VLAs) utilizing Efficient Attention mechanisms (e.g. FlashAttention Linear Attention) KV-cache optimization (e.g. PagedAttention) and Speculative Decoding.
Extensive experience with model conversion/compilation pipelines (e.g. ONNX TensorRT ) and performing rigorous latency benchmark and model quality parity valuation.
Proficiency in low-level programming for AI accelerators specifically developing and optimizing custom ML OPs and TensorRT Plugins with efficient CUDA kernel implementations.
Production-level C (14/17/20) and Python programming skills with experience developing concurrent memory-safe real-time inference code for edge devices.

Bonus Qualifications:

Familiarity with SOTA autonomous driving perception algorithms (temporal 3D object detection BEV 3D Occupancy Networks) and multi-modal sensor processing (Vision LiDAR Radar).
Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed Ray DeepSpeed Megatron-LM) and runtime efficiency optimization for GPU clusters.
Experience with end-to-end autonomous driving paradigms (VLM/VLA models Foundation models) and edge deployment technologies (e.g. TensorRT-LLM).

$242000 - $290000 a year

Base Salary Range

There are three major components to compensation for this position: salary Amazon Restricted Stock Units (RSUs) and Zoox Stock Appreciation Rights. A sign-on bonus may be offered as part of the compensation package. The listed range applies only to the base salary. Compensation will vary based on geographic location and level. Leveling as well as positioning within a level is determined by a range of factors including but not limited to a candidates relevant years of experience domain knowledge and interview performance. The salary range listed in this posting is representative of the range of levels Zoox is considering for this position.

Zoox also offers a comprehensive package of benefits including paid time off (e.g. sick leave vacation bereavement) unpaid time off Zoox Stock Appreciation Rights Amazon RSUs health insurance long-term care insurance long-term and short-term disability insurance and life insurance.

About Zoox

Zoox is developing the first ground-up fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. Sitting at the intersection of robotics machine learning and design Zoox aims to provide the next generation of mobility-as-a-service in urban environments. Were looking for top talent that shares our passion and wants to be part of a fast-moving and highly execution-oriented team.

Accommodations

If you need an accommodation to participate in the application or interview process please reach out to emailprotected or your assigned recruiter.

A Final Note:

We may use artificial intelligence (AI) tools to support parts of the hiring process such as reviewing applications analyzing resumes or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed please contact us.

Required Experience:

Senior IC

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.As a Model Optimization & Deployment Engineer you will focus on bringing highly efficient production-ready large-scale models to our on-vehicle stack....

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.

In this role you will:

Optimize large-scale models (Multi-Modal Sensor Fusion models LLMs VLMs) using advanced quantization (PTQ QAT) pruning mixed-precision inference frameworks and parameter-efficient fine-tuning (LoRA QLoRA).
Architect and implement model conversion and compilation pipelines using TensorRT for edge deployment.
Perform rigorous parity checking accuracy recovery and latency benchmarking between PyTorch frameworks and compiled edge binaries.
Develop and optimize custom ML OPs and TensorRT Plugins with efficient CUDA kernels to minimize latency and maximize memory bandwidth on AI accelerators.
Write production-level low latency and memory-safe C and CUDA code for real-time inference on vehicle systems.

Qualifications:

Deep expertise in model quantization (PTQ QAT) and mixed-precision inference frameworks (INT8 FP8 FP4 BF16/FP16).
Proven experience optimizing large-scale models (Multi-Modal Sensor Fusion models LLMs VLMs/VLAs) utilizing Efficient Attention mechanisms (e.g. FlashAttention Linear Attention) KV-cache optimization (e.g. PagedAttention) and Speculative Decoding.
Extensive experience with model conversion/compilation pipelines (e.g. ONNX TensorRT ) and performing rigorous latency benchmark and model quality parity valuation.
Proficiency in low-level programming for AI accelerators specifically developing and optimizing custom ML OPs and TensorRT Plugins with efficient CUDA kernel implementations.
Production-level C (14/17/20) and Python programming skills with experience developing concurrent memory-safe real-time inference code for edge devices.

Bonus Qualifications:

Familiarity with SOTA autonomous driving perception algorithms (temporal 3D object detection BEV 3D Occupancy Networks) and multi-modal sensor processing (Vision LiDAR Radar).
Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed Ray DeepSpeed Megatron-LM) and runtime efficiency optimization for GPU clusters.
Experience with end-to-end autonomous driving paradigms (VLM/VLA models Foundation models) and edge deployment technologies (e.g. TensorRT-LLM).

$242000 - $290000 a year

Base Salary Range

About Zoox

Accommodations

If you need an accommodation to participate in the application or interview process please reach out to emailprotected or your assigned recruiter.

A Final Note:

Required Experience:

Senior IC

Apply Now

About Company

Zoox

We’re reinventing personal transportation—making the future safer, cleaner, and more enjoyable for everyone. This is on-demand autonomous ride-hailing.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Senior AI Inference Engineer Model Optimization & Deployment

San Diego, CA - USA

Job Summary

In this role you will:

Qualifications:

Bonus Qualifications:

In this role you will:

Qualifications:

Bonus Qualifications:

About Company

Related Jobs