Machine Learning Engineer Multi-Modality Foundation Model

Boston, NH - USA

Monthly Salary: Not Disclosed

Posted on: 3 days ago

Vacancies: 1 Vacancy

Job Summary

The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence. As a Multi-modality Foundation Model Engineer you will focus on building highly efficient production-ready multi-modality models. We are looking for experts who have hands-on experience building multi-modality foundation modelswhether that involves AV-centric modalities (Vision LiDAR Radar) or broader domains (Vision Language Text Audio). You will design train and deploy these models using Knowledge Distillation (KD) to transfer capabilities from large-scale proprietary teacher models to efficient student models capable of real-time on-vehicle inference.

In this role you will:

Build pre-train and evaluate large-scale multi-modality foundation models from the ground up successfully aligning diverse data streams (e.g. Vision LiDAR Radar Language Audio).
Define and execute the ML roadmap for deploying these multi-modality representations to the vehicle.
Architect and implement Knowledge Distillation pipelines to compress large-capacity multi-modal teacher models into highly efficient production-ready student models.
Build high-quality training and evaluation datasets applying advanced data-centric techniques to maximize cross-modal representation learning and student model convergence.
Collaborate with downstream perception teams to integrate and validate the performance robustness and latency of your models in on-board production systems.

Qualifications:

MS or PhD in Computer Science Machine Learning or a related technical field with demonstrated professional experience.
Deep proven expertise in building and training large-scale multi-modality foundation models (e.g. Vision-Language Models (VLMs) Vision-Audio-Text or Vision-LiDAR-Radar architectures).
Strong understanding of cross-modal alignment multi-modal attention mechanisms and large-scale pre-training techniques.
Proven experience in Knowledge Distillation (KD) model compression and training highly efficient student models for production environments.
Proficiency in ML frameworks (e.g. PyTorch) and experience building large-scale ML training and evaluation pipelines.

Bonus Qualifications:

Experience in the Autonomous Driving or robotics industry.
Experience with model deployment optimization and hardware constraints (e.g. C for inference TensorRT quantization pruning).
Publications in top-tier conferences (CVPR ICCV NeurIPS ICLR ACL) related to multi-modality foundation models cross-modal learning or model compression.

$189000 - $258000 a year

Base Salary Range

There are three major components to compensation for this position: salary Amazon Restricted Stock Units (RSUs) and Zoox Stock Appreciation Rights. A sign-on bonus may be offered as part of the compensation package. The listed range applies only to the base salary. Compensation will vary based on geographic location and level. Leveling as well as positioning within a level is determined by a range of factors including but not limited to a candidates relevant years of experience domain knowledge and interview performance. The salary range listed in this posting is representative of the range of levels Zoox is considering for this position.

Zoox also offers a comprehensive package of benefits including paid time off (e.g. sick leave vacation bereavement) unpaid time off Zoox Stock Appreciation Rights Amazon RSUs health insurance long-term care insurance long-term and short-term disability insurance and life insurance.

About Zoox

Zoox is developing the first ground-up fully autonomous vehicle fleet and the supporting ecosystem required to bring this technology to market. Sitting at the intersection of robotics machine learning and design Zoox aims to provide the next generation of mobility-as-a-service in urban environments. Were looking for top talent that shares our passion and wants to be part of a fast-moving and highly execution-oriented team.

Accommodations

If you need an accommodation to participate in the application or interview process please reach out to emailprotected or your assigned recruiter.

A Final Note:

We may use artificial intelligence (AI) tools to support parts of the hiring process such as reviewing applications analyzing resumes or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed please contact us.

Required Experience:

In this role you will:

Build pre-train and evaluate large-scale multi-modality foundation models from the ground up successfully aligning diverse data streams (e.g. Vision LiDAR Radar Language Audio).
Define and execute the ML roadmap for deploying these multi-modality representations to the vehicle.
Architect and implement Knowledge Distillation pipelines to compress large-capacity multi-modal teacher models into highly efficient production-ready student models.
Build high-quality training and evaluation datasets applying advanced data-centric techniques to maximize cross-modal representation learning and student model convergence.
Collaborate with downstream perception teams to integrate and validate the performance robustness and latency of your models in on-board production systems.

Qualifications:

MS or PhD in Computer Science Machine Learning or a related technical field with demonstrated professional experience.
Deep proven expertise in building and training large-scale multi-modality foundation models (e.g. Vision-Language Models (VLMs) Vision-Audio-Text or Vision-LiDAR-Radar architectures).
Strong understanding of cross-modal alignment multi-modal attention mechanisms and large-scale pre-training techniques.
Proven experience in Knowledge Distillation (KD) model compression and training highly efficient student models for production environments.
Proficiency in ML frameworks (e.g. PyTorch) and experience building large-scale ML training and evaluation pipelines.

Bonus Qualifications:

Experience in the Autonomous Driving or robotics industry.
Experience with model deployment optimization and hardware constraints (e.g. C for inference TensorRT quantization pruning).
Publications in top-tier conferences (CVPR ICCV NeurIPS ICLR ACL) related to multi-modality foundation models cross-modal learning or model compression.

$189000 - $258000 a year

Base Salary Range

About Zoox

Accommodations

If you need an accommodation to participate in the application or interview process please reach out to emailprotected or your assigned recruiter.

A Final Note:

Required Experience:

Key Skills

Python
C/C++
Fortran
R
Data Mining
Matlab
Data Modeling
Laboratory Techniques
MongoDB
SAS
Systems Analysis
Dancing

Apply Now

About Company

Zoox

We’re reinventing personal transportation—making the future safer, cleaner, and more enjoyable for everyone. This is on-demand autonomous ride-hailing.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Machine Learning Engineer Multi-Modality Foundation Model

Boston, NH - USA

Job Summary

In this role you will:

Qualifications:

Bonus Qualifications:

In this role you will:

Qualifications:

Bonus Qualifications:

Key Skills

About Company

Related Jobs