AIML Staff ML Infrastructure Engineer, ML Platform & Technology ML Compute

Apple

Not Interested
Bookmark
Report This Job

profile Job Location:

San Francisco, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

As a staff engineer on ML Compute team your work will include:- Lead the development of the infrastructure to run large-scale workloads on the Cloud such as Apache Spark Ray and distributed training.- Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue.- Integrate new features from core distributed computing and ML frameworks into the platform offering them to production users and providing support.- Enhance the platforms scalability performance and observability through improved monitoring and logging.- Drive the architectural evolution of the platform by adopting modern cloud-native technologies to improve system performance efficiency and scalability.- Reduce dev-ops efforts by automating and streamlining operational processes.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.


  • Bachelors in Computer Science engineering or a related field
  • 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
  • Proficient in relevant programming languages like Python or Go
  • Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
  • Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
  • Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find solutions


  • Advance degrees in Computer Science engineering or a related field.
  • Hands-on experience with cloud-native resource management and scheduling tools like Apache YuniKorn.
  • Experience with advanced architecture for distributed data processing and ML workloads.
  • Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium.

Required Experience:

Staff IC

As a staff engineer on ML Compute team your work will include:- Lead the development of the infrastructure to run large-scale workloads on the Cloud such as Apache Spark Ray and distributed training.- Optimize platform efficiency and throughput by improving resource management capabilities with sc...
View more view more

Key Skills

  • Jenkins
  • Ruby
  • Python
  • Active Directory
  • Cloud
  • PowerShell
  • Windows
  • AWS
  • Linux
  • SAN
  • Java
  • Troubleshoot
  • Backup
  • Puppet
  • hardware

About Company

Company Logo

Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar ... View more

View Profile View Profile