Senior ML Engineer

Apple

Job Location:

Seattle, WA - USA

Monthly Salary: Not Disclosed

Posted on: Yesterday

Vacancies: 1 Vacancy

Job Summary

Imagine what you could do here. At Apple great ideas have a way of becoming phenomenal products services and customer experiences very quickly. Bring passion and dedication to your job and theres no telling what you could you love solving complex distributed systems challenges at massive scale Are you passionate about Kubernetes scheduling resource management and building platforms that power the next generation of Machine Learning and Data workloads Do you thrive in designing and operating highly reliable large-scale job scheduling and orchestration systems that serve as the backbone of AI and Data infrastructure If so join the Apple Data Platform team to design and build a scalable batch and ML infrastructure platform used across part of Apple Data Platform you will play a meaningful role in designing developing and deploying high-performance systems that power batch and ML workloads across Apples global infrastructure spanning public clouds and Apple data centers. This enormous scale brings unique and complex challenges in resource scheduling workload orchestration and operational excellence that require extraordinarily creative problem-solving.

Apple Batch is a fully managed platform within the Apple Data Platform that supports large-scale batch and ML workloads across Apple data centers and AWS/GCP. It orchestrates containerized workloads such as Spark Ray and LLM batch inference using YuniKorn/Kueue for advanced multi-cluster scheduling. The platform delivers org/team quota management automatic node repair end-to-end observability strong security and granular cost part of the Apple Batch team you will have a meaningful role in designing developing and deploying high-performance systems that power large-scale batch processing and ML workloads daily. We are building critical infrastructure that provides scalable batch execution intelligent Kubernetes-native job scheduling multi-tenant resource management and efficient workload orchestration for ML training inference and data processing workloads across multi-cloud and on-premises are looking for a strong enthusiastic engineer with deep expertise in Kubernetes scheduling and distributed systems. You will have significant individual responsibility and influence over critical platform services. You are someone with ideas and a real passion for building infrastructure that improves reliability efficiency and simplicity at Apple scale.n

Design build and deploy highly reliable large-scale distributed systems for batch processing and ML infrastructure across public clouds and Apple data centers using Go Java or PythonnArchitect and operate Kubernetes-native scheduling systems such as Kueue and YuniKorn building custom operators and CRDs to manage complex ML and data workloadsnImplement advanced scheduling strategies including gang scheduling topology-aware routing bin-packing and fair-share queuing to maximize GPU efficiency and hardware utilizationnBuild and manage secure multi-tenant Kubernetes environments with strict resource isolation quota governance and priority-based preemptionnDrive end-to-end observability monitoring and incident response practices to ensure high availability and fault tolerance of production systemsnCollaborate with ML researchers data engineers SRE and product teams to integrate scheduling solutions into Apples broader AI and data platform ecosystemnContribute to platform adoption by guiding internal customers gathering requirements and delivering impactful platform capabilitiesn

5 years of experience designing developing and operating highly available large-scale distributed systems and data or ML infrastructurenStrong software engineering skills with deep programming expertise in Go Java or PythonnAdvanced knowledge of Kubernetes internals including custom controllers scheduler architecture resource quotas and workload lifecycle managementnHands-on experience with Kubernetes-native batch scheduling frameworks such as Kueue or YuniKorn and advanced scheduling concepts like gang scheduling bin-packing and priority preemptionnExperience with cloud-native infrastructure across multi-cloud environments including AWS GCP and on-premises systemsnStrong commitment to operational excellence system observability and continuous improvement for mission-critical services

GPU scheduling accelerator-aware placement and optimization for large-scale AI/ML workloadsnExperience with distributed data and ML frameworks such as Apache Spark Ray PyTorch JAX or Flink at scalenExperience contributing to open-source projects in Kubernetes scheduling container technologies or ML infrastructure ecosystems such as Apache YuniKorn Kueue or similar systemsnExperience using GenAI technologies to improve developer productivity streamline engineering processes and accelerate team executionn

Required Experience:

Senior IC

Required Experience:

Senior IC

Apply Now

About Company

Apple

Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click