Senior Research Engineer, Training Data Infrastructure in Foundation Models

Apple

Job Location:

Cupertino, CA - USA

Monthly Salary: Not Disclosed

Posted on: 12 hours ago

Vacancies: 1 Vacancy

Job Summary

We build frontier foundation models that power intelligent experiences at Apple. Our team works across the full training lifecycle: including pre-training foundation models and developing mid-training approaches that bridge general capability and task-specific performance. What makes our work distinct is that were engineering models specifically for Apple silicon and optimized for experiences that are private personal and deeply integrated into the OS. Were solving frontier problems in reward modeling to resist reward hacking handling sparse and delayed rewards in agentic settings and aligning models reliably across the spectrum from open-ended creative tasks to precise action-taking workflows. If youre drawn to hard problems where the research and the product are inseparable this is the team.

This position operates at the convergence of Software Engineering and Machine Learning Research. Unlike traditional backend roles this position requires you to design systems where the outcome is the statistical distribution and quality of data itself. You will work alongside Research Scientists to transform theoretical observations into concrete scalable engineering solutions. Your core focus will be the architecture of our Data Acquisition Processing and Repository Management systems for Large Model training. You will lead technical efforts to enable active quality-driven data curation including filtering deduping synthetic data generation and data mixing ensuring our models are trained on the highest-quality information available.

Architect Scalable Ingestion Systems: Design and implement high-throughput distributed systems to ingest petabytes of text and multimodal data from diverse sources including web crawls and third-party Optimization: Manage the lifecycle of large-scale datasets across data storage and high-performance file systems. Optimize data formats for efficient random access and sequential scanning during model Governance u0026 Privacy: Engineer robust data governance and privacy solutions for the training data in collaboration with compliance and legal teams to ensure adherence to stringent regulatory -Performance Processing Pipelines: Build and maintain distributed data processing workflows using advanced frameworks on cloud infrastructure (e.g. GCP AWS).nAlgorithmic Data Curation: Implement sophisticated data filtering and selection logic to remove low-quality content. Develop semantic deduplication at scale to prevent model memorization and improve training Removal: Design automated systems to detect and remove benchmark leakage ensuring that evaluation datasets remain strictly isolated from training for Scaling Laws: Collaborate with researchers to enable data ablations and scaling experiments. Build tools to support systematic data mixture optimization and empirically data studies.

Education: Bachelors degree in Computer Science Electrical Engineering or Expertise: 4 years of software engineering experience with a specific focus on Data Infrastructure Distributed Systems or AI/ML Proficiency: Expert fluency in Python and strong competence in system languages such as Architecture: Extensive experience architecting solutions on major public cloud platforms (e.g. GCP) to build scalable data systems (e.g. with Apache Beam GCS)nPerformance Engineering: Deep experience profiling and optimizing high-throughput data systems. Demonstrated ability to debug distributed bottlenecks (e.g. stragglers I/O saturation) optimize data formats and provide efficient data storage solutions.

Research Collaboration: Experience working within or closely with ML research organizations (e.g. as a Research Engineer) with an ability to translate research results into engineering Knowledge: Familiarity with lifecycle of modern LLM training end-to-end workflows and underlying system Data Types: Experience in processing complex data modalities beyond plain text such as source code repositories images videos and audios.

Required Experience:

Senior IC

Required Experience:

Senior IC

Apply Now

About Company

Apple

Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click