AIML Staff ML Infrastructure Engineer, ML Platform & Technology Pre-training Compute

Apple

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 29 days ago
Vacancies: 1 Vacancy

Job Summary

As an engineer on ML Compute team your work will include:- Drive large-scale pre-training initiatives to support cutting-edge foundation models focusing on resiliency efficiency scalability and resource optimization.- Enhance distributed training techniques for foundation models.- Research and implement new patterns and technologies to improve system performance maintainability and design.- Optimize execution and performance of workloads built with JAX PyTorch XLA and CUDA on large distributed systems.- Leverage high-performance networking technologies such as NCCL for GPU collectives and TPU interconnect (ICI/Fabric) for large-scale distributed training.- Architect a robust MLOps platform to streamline and automate pretraining operations.- Operationalize large-scale ML workloads on Kubernetes ensuring distributed trainings are robust efficient and fault-tolerant.- Lead complex technical projects defining requirements and tracking progress with team members.- Collaborate with cross-functional engineers to solve large-scale ML training challenges.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.- Cultivate a team centered on collaboration technical excellence and innovation.


  • Bachelors in Computer Science engineering or a related field
  • 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
  • Proficient in relevant programming languages like Python or Go
  • Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
  • Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
  • Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find


  • Advance degrees in Computer Science engineering or a related field
  • Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium
  • Proficient in ML training and deployment frameworks like: JAX Tensorflow PyTorch TensorRT vLLM

Required Experience:

Staff IC

As an engineer on ML Compute team your work will include:- Drive large-scale pre-training initiatives to support cutting-edge foundation models focusing on resiliency efficiency scalability and resource optimization.- Enhance distributed training techniques for foundation models.- Research and imple...
View more view more

Key Skills

  • Jenkins
  • Ruby
  • Python
  • Active Directory
  • Cloud
  • PowerShell
  • Windows
  • AWS
  • Linux
  • SAN
  • Java
  • Troubleshoot
  • Backup
  • Puppet
  • hardware

About Company

Company Logo

Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar ... View more

View Profile View Profile