ML Safety Engineer

Apple

Job Location:

San Francisco, CA - USA

Monthly Salary: Not Disclosed

Posted on: Yesterday

Vacancies: 1 Vacancy

Job Summary

Apple Services Engineering (ASE) powers many AI features across App Store Music Video and more. We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating systemic biases and maintain safe and trustworthy experiences across our AI tools and models.

Our team part of Apple Services Engineering is looking for an ML Research Engineer to lead the design and continuous development of automated safety benchmarking this role you will investigate how media-related agents behave develop rigorous evaluation frameworks and techniques and establish scientific standards for assessing risks they pose and safety performance. This role supports the development of scalable evaluation techniques that ensure our engineers have the right tools to assess candidate models and product features for responsible and safe performance. nnThe capabilities you build will allow for the generation of benchmark datasets and evaluation methodologies for model and application outputs at scale to enable engineering teams to translate safety insights into actionable engineering and product improvements. This role blends deep technical expertise with strong analytical judgment to develop tools and capabilities for assessing and improving the behavior of advanced AI/ML models. You will work cross-functionally with Engineering and Project Managers Product and Governance teams to develop a suite of technologies to ensure that AI experiences are reliable safe and aligned with human successful candidate will take a proactive approach to working independently and collaboratively on a wide range of this role you will work alongside a small but impactful team collaborating with ML and data scientists software developers project managers and other teams at Apple to understand requirements and translate them into scalable reliable and efficient evaluation frameworks.

Design scientifically-grounded benchmarking methodologies covering multiple dimensions of responsibility and safety across several media and application marketplace use casesnDevelop automated evaluation pipelines that collect automatically judge and analyze model outputs with respect to safety policies at scalenCreate and curate datasets tasks and feature usage scenarios that represent realistic and adversarial use cases across multiple languages markets and domainsnDefine and validate new metrics for complex phenomena such as multi-turn agentic interaction patternsnApply statistical rigor and reproducibility to above mentioned objectivesnWork closely with engineering and research teams to translate experimental findings into actionable model improvements and safety mitigationsnPublish internal reports and external papersnMonitor evolving industry practices and academic work to ensure benchmarks remain relevant

Advanced degree (MS or PhD) in Computer Science Software Engineering or equivalent research/work experiencen1 years of work experience either as a postdoc or in the industrynStrong research background in empirical evaluation experimental design or benchmarkingnStrong proficiency in Python (pandas NumPy Jupyter PyTorch etc.)nDeep familiarity with software engineering workflows and developer toolsnExperience working with or evaluating AI/ML models preferably LLMs or program synthesis systemsnStrong analytical and communication skills including the ability to write clear reportsnnTechnical Skills:nProficiency in Python (pandas NumPy Jupyter PyTorch etc.).nExperience working with large datasets annotation tools and model evaluation pipelinesnFamiliarity with evaluations specific to responsible AI and safety hallucination detection and/or model alignment concernsnAbility to design taxonomies categorization schemes and structured labeling frameworksnAnalytical Strength: Ability to interpret unstructured data (text transcripts user sessions) and derive meaningful insightsnCommunication: Strong ability to stitch together qualitative and quantitative insights into actionable guidance; strong ability to communicate complex architectures and systems to a variety of stakeholdersnEducation in Data Science Linguistics Cognitive Science HCI Psychology Social Science or a related field

Publications in AI/ML evaluation or related fieldsnExperience with automated testing frameworksnExperience constructing human-in-the-loop or multi-turn evaluation setupsnIntermediate or Advanced Proficiency in Swift nFamiliarity with RAG systems reinforcement learning agentic architectures and model fine-tuningnExpertise in designing annotation guidelines and validation instruments and techniquesnBackground in human factors social science and/or safety assessment methodologies

Required Experience:

Apply Now

About Company

Apple

Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click