Were building the evaluation platform that will serve all of Apples generative AI and agent systems. Evaluating non-deterministic AI systems is one of the hardest unsolved problems in production ML and one Apple has to get right at scale. Were building the platform that makes it tractable for every team is a hands-on engineering role with a lot of autonomy. Youll write a lot of Python and own meaningful pieces of the platform end-to-end. Youll be partnering closely with research engineers model and serving teams product and feature teams and the infra and data platform groups this work integrates with.
Build and ship: Take ownership of features and services within the evaluation platform: APIs SDKs orchestration components evaluation runners. Youll have the room to make calls on your own work and the support to deliver it ML research: Partner with research engineers to take their prototype code and turn it into reliable services. Youll learn their world quickly and translate research patterns into clean Python that holds up under real fastresponsibly: Youll get scoped problems with room to figure out the how. We trust you to balance speed with care to know when something needs a quick prototype and when it needs a design doc tests and a careful as you go: Notice the rough edges and pick them up. The flaky test the slow build the confusing API the runbook thats out of date. We want someone who leaves the codebase a little better every experience: Help build the SDKs and abstractions that other Apple teams use to evaluate their models and agents. Youll feel the friction of bad ergonomics directly which puts you in a great position to fix ownership: Your code runs in production. You write the tests set up the CI add the metrics and stay close when something breaks. You dont need to be an SRE but you take care of what you ship.
Build and ship: Take ownership of features and services within the evaluation platform: APIs SDKs orchestration components evaluation runners. Youll have the room to make calls on your own work and the support to deliver it ML research: Partner with research engineers to take their prototype code and turn it into reliable services. Youll learn their world quickly and translate research patterns into clean Python that holds up under real fastresponsibly: Youll get scoped problems with room to figure out the how. We trust you to balance speed with care to know when something needs a quick prototype and when it needs a design doc tests and a careful as you go: Notice the rough edges and pick them up. The flaky test the slow build the confusing API the runbook thats out of date. We want someone who leaves the codebase a little better every experience: Help build the SDKs and abstractions that other Apple teams use to evaluate their models and agents. Youll feel the friction of bad ergonomics directly which puts you in a great position to fix ownership: Your code runs in production. You write the tests set up the CI add the metrics and stay close when something breaks. You dont need to be an SRE but you take care of what you ship.
4-8 years of software engineering experience building and shipping production Python. Youre fluent with FastAPI Pydantic and the modern Python ecosystem. You write code thats clean tested and easy for the next person to pick mindset. You enjoy shipping. Youre comfortable iterating quickly on scoped problems and knowing when to slow down for the parts that need with AI coding tools. You actively use tools like Claude Code (or equivalents) in your day-to-day workflow including features like skills slash commands and agent-style workflows. You have a good intuition for when to lean on them when to steer them and how to get high-quality with the agentic LLM landscape. You stay current on how modern LLM systems work in production tool use MCP servers agent frameworks context management multi-step reasoning. You can hold a real conversation about the -on evaluation experience. Youve built evaluations for your own agents or LLM systems or youve worked with evaluation orchestration frameworks like Inspect Braintrust LangSmith Promptfoo or equivalents (including internal tooling). You understand what makes an evaluation trustworthy vs. working knowledge of LLMs in production. Youre comfortable with prompt iteration dataset curation judge models and statistical reasoning about non-deterministic outputs. You understand the lifecycle around models even if you havent trained them engineering fundamentals. You understand testing CI/CD containerization (Docker) and basic observability. Youve shipped services that others depend on and stayed close when they communicator. You write clear PRs ask sharp questions and flag blockers early. Youre comfortable disagreeing thoughtfully and changing your mind when the argument is . When something is broken or unclear you tend to pick it up rather than wait. You either move it forward or surface it clearly.
Experience working on developer platforms internal tools or SDKsnProduction experience with LLM/agent systems building evaluating or operating themnFamiliarity with job orchestration frameworks ( Airflow or similar)nDistributed compute experience (Ray Dask or Kubernetes-based job systems)nExperience with experiment tracking or ML lifecycle tooling (Weights u0026 Biases MLflow etc.)nStartup or early-stage experience where you wore multiple hats and shipped under constraint
Required Experience:
IC
Were building the evaluation platform that will serve all of Apples generative AI and agent systems. Evaluating non-deterministic AI systems is one of the hardest unsolved problems in production ML and one Apple has to get right at scale. Were building the platform that makes it tractable for every...
Were building the evaluation platform that will serve all of Apples generative AI and agent systems. Evaluating non-deterministic AI systems is one of the hardest unsolved problems in production ML and one Apple has to get right at scale. Were building the platform that makes it tractable for every team is a hands-on engineering role with a lot of autonomy. Youll write a lot of Python and own meaningful pieces of the platform end-to-end. Youll be partnering closely with research engineers model and serving teams product and feature teams and the infra and data platform groups this work integrates with.
Build and ship: Take ownership of features and services within the evaluation platform: APIs SDKs orchestration components evaluation runners. Youll have the room to make calls on your own work and the support to deliver it ML research: Partner with research engineers to take their prototype code and turn it into reliable services. Youll learn their world quickly and translate research patterns into clean Python that holds up under real fastresponsibly: Youll get scoped problems with room to figure out the how. We trust you to balance speed with care to know when something needs a quick prototype and when it needs a design doc tests and a careful as you go: Notice the rough edges and pick them up. The flaky test the slow build the confusing API the runbook thats out of date. We want someone who leaves the codebase a little better every experience: Help build the SDKs and abstractions that other Apple teams use to evaluate their models and agents. Youll feel the friction of bad ergonomics directly which puts you in a great position to fix ownership: Your code runs in production. You write the tests set up the CI add the metrics and stay close when something breaks. You dont need to be an SRE but you take care of what you ship.
Build and ship: Take ownership of features and services within the evaluation platform: APIs SDKs orchestration components evaluation runners. Youll have the room to make calls on your own work and the support to deliver it ML research: Partner with research engineers to take their prototype code and turn it into reliable services. Youll learn their world quickly and translate research patterns into clean Python that holds up under real fastresponsibly: Youll get scoped problems with room to figure out the how. We trust you to balance speed with care to know when something needs a quick prototype and when it needs a design doc tests and a careful as you go: Notice the rough edges and pick them up. The flaky test the slow build the confusing API the runbook thats out of date. We want someone who leaves the codebase a little better every experience: Help build the SDKs and abstractions that other Apple teams use to evaluate their models and agents. Youll feel the friction of bad ergonomics directly which puts you in a great position to fix ownership: Your code runs in production. You write the tests set up the CI add the metrics and stay close when something breaks. You dont need to be an SRE but you take care of what you ship.
4-8 years of software engineering experience building and shipping production Python. Youre fluent with FastAPI Pydantic and the modern Python ecosystem. You write code thats clean tested and easy for the next person to pick mindset. You enjoy shipping. Youre comfortable iterating quickly on scoped problems and knowing when to slow down for the parts that need with AI coding tools. You actively use tools like Claude Code (or equivalents) in your day-to-day workflow including features like skills slash commands and agent-style workflows. You have a good intuition for when to lean on them when to steer them and how to get high-quality with the agentic LLM landscape. You stay current on how modern LLM systems work in production tool use MCP servers agent frameworks context management multi-step reasoning. You can hold a real conversation about the -on evaluation experience. Youve built evaluations for your own agents or LLM systems or youve worked with evaluation orchestration frameworks like Inspect Braintrust LangSmith Promptfoo or equivalents (including internal tooling). You understand what makes an evaluation trustworthy vs. working knowledge of LLMs in production. Youre comfortable with prompt iteration dataset curation judge models and statistical reasoning about non-deterministic outputs. You understand the lifecycle around models even if you havent trained them engineering fundamentals. You understand testing CI/CD containerization (Docker) and basic observability. Youve shipped services that others depend on and stayed close when they communicator. You write clear PRs ask sharp questions and flag blockers early. Youre comfortable disagreeing thoughtfully and changing your mind when the argument is . When something is broken or unclear you tend to pick it up rather than wait. You either move it forward or surface it clearly.
Experience working on developer platforms internal tools or SDKsnProduction experience with LLM/agent systems building evaluating or operating themnFamiliarity with job orchestration frameworks ( Airflow or similar)nDistributed compute experience (Ray Dask or Kubernetes-based job systems)nExperience with experiment tracking or ML lifecycle tooling (Weights u0026 Biases MLflow etc.)nStartup or early-stage experience where you wore multiple hats and shipped under constraint
Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar
... View more