ML Platform Engineer

Avride

Job Location:

Austin, TX - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Job Summary

About the team

The ML Platform team at Avride builds the infrastructure that powers large-scale ML training and data processing for autonomous driving. We sit between Cloud Platform and ML engineers turning low-level compute storage and networking primitives into an ML platform that teams actually use scalable orchestration distributed compute and production-grade tooling for the full model lifecycle.

About the role

As an ML Platform Engineer at Avride youll own critical pieces of the ML stack: workflow orchestration distributed execution resource governance will shape how ML teams across the company run experiments and train models at scale. You will build the abstractions and services that make training workloads reliable cost-efficient and fast helping ML teams run at scale on Kubernetes with strong reliability and excellent developer experience.

What you will do

Build and scale our ML compute platform on Kubernetes using Argo Workflows for training evaluation and data processing orchestration
Design and implement core platform capabilities including a Ray-based internal SDK for distributed execution and multi-tenant resource governance scheduling priorities quotas and policy enforcement across GPU CPU memory and IO
Improve end-to-end training throughput and platform efficiency by optimizing data access patterns caching and removing bottlenecks in storage network and resource contention
Work directly with ML teams to debug complex workload issues drive root-cause analysis and turn recurring problems into platform-level fixes
Evaluate integrate and extend open-source tooling (Argo Workflows Ray Kubernetes ecosystem) to meet evolving platform needs

What you will need

Strong proficiency in Python or Go; C is a plus
Track record of designing and building scalable maintainable systems and services
Experience operating production services end-to-end: APIs reliability practices observability
Deep knowledge of Kubernetes: how scheduling resource management controllers and pod lifecycle actually behave under pressure
Solid Linux and systems debugging skills: performance investigation networking storage/IO
Ability to troubleshoot complex production issues across logs metrics and traces and drive them to resolution

Nice to have

Experience with Argo Workflows Ray MLflow or comparable distributed ML tooling
Hands-on experience building or operating large-scale ML training systems: GPU scheduling distributed training training data pipelines
Track record of optimizing resource usage and performance in distributed environments

Candidates are required to be authorized to work in the U.S. The employer is not offering relocation sponsorship and remote work options are not available.

Avride is an equal opportunity employer and committed to providing reasonable accommodations to qualified applicants and employees with disabilities to ensure they have equal access to employment opportunities. Avride complies with the Americans with Disabilities Act (ADA) if you need a reasonable accommodation to assist with the application or hiring process or to perform the essential functions of a job please email .

Required Experience:

About the teamThe ML Platform team at Avride builds the infrastructure that powers large-scale ML training and data processing for autonomous driving. We sit between Cloud Platform and ML engineers turning low-level compute storage and networking primitives into an ML platform that teams actually us...