Research Engineer – Benchmarking, Evals & Failure Analysis

Recruiting From Scratch

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 130000 - 400000

Posted on: 12 hours ago

Vacancies: 1 Vacancy

Job Summary

Research Engineer Benchmarking Evals & Failure Analysis

Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130000 $400000 Equity

Company Description

This fast-growing AI company is operating at the forefront of applied machine learning and labor transformation. By partnering with leading AI labs and enterprises they are building systems that combine human expertise with cutting-edge AI to improve model performance and unlock new categories of work. With strong revenue scale and backing from top-tier investors they are shaping how frontier models are trained evaluated and deployed in real-world environments.

What You Will Do

Design and implement benchmarking systems to evaluate model capabilities such as tool use reasoning and agent behavior
Build and operate end-to-end evaluation pipelines including scoring systems dashboards and reporting infrastructure
Conduct systematic failure analysis on model outputs identifying key failure modes and translating them into actionable improvements
Develop rubrics evaluators and scoring frameworks that balance rigor with scalability (human automated evaluation)
Partner with research and applied AI teams to align evaluation systems with training and product goals
Analyze data quality and performance trends to inform model training data generation and post-training strategies
Own evaluation and benchmarking systems in a fast-paced high-iteration environment

Ideal Background

Strong applied AI or ML engineering experience particularly in model evaluation benchmarking or failure analysis
Hands-on experience building or running LLM evaluation systems benchmarks or experimentation pipelines
Strong coding ability (Python or similar) with experience building production-quality systems
Solid understanding of data structures algorithms and backend systems
Experience working with APIs databases (SQL/NoSQL) and cloud infrastructure
Ability to reason deeply about model behavior experimental results and system performance
Comfortable operating in ambiguous high-ownership environments with rapid iteration cycles

Preferred

Experience working on post-training RL or evaluation teams at AI labs or AI-first companies
Familiarity with LLM evaluation techniques benchmarking frameworks or agent evaluation systems
Experience with synthetic data generation rubric design or reward modeling workflows
Publications or research experience in ML especially in evaluation or benchmarking
Exposure to large-scale experimentation systems or model performance tracking infrastructure

Compensation and Benefits

Competitive base salary ($130K $400K) meaningful equity
Relocation and housing support available
Monthly meal stipend and premium wellness perks (e.g. fitness membership)
Comprehensive health insurance
Opportunity to work directly with frontier AI labs and influence model development at the cutting edge

This is a high-impact role at the intersection of engineering and applied AI research ideal for candidates excited about defining how next-generation models are evaluated improved and deployed at scale.

Required Experience:

Research Engineer Benchmarking Evals & Failure AnalysisLocation: San FranciscoCompany Stage: Late-Stage / Series C (AI / Applied ML)Office Type: Onsite (5 Days a Week)Salary: $130000 $400000 EquityCompany DescriptionThis fast-growing AI company is operating at the forefront of applied machine lea...

Research Engineer Benchmarking Evals & Failure Analysis

Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130000 $400000 Equity

Company Description

What You Will Do

Design and implement benchmarking systems to evaluate model capabilities such as tool use reasoning and agent behavior
Build and operate end-to-end evaluation pipelines including scoring systems dashboards and reporting infrastructure
Conduct systematic failure analysis on model outputs identifying key failure modes and translating them into actionable improvements
Develop rubrics evaluators and scoring frameworks that balance rigor with scalability (human automated evaluation)
Partner with research and applied AI teams to align evaluation systems with training and product goals
Analyze data quality and performance trends to inform model training data generation and post-training strategies
Own evaluation and benchmarking systems in a fast-paced high-iteration environment

Ideal Background

Strong applied AI or ML engineering experience particularly in model evaluation benchmarking or failure analysis
Hands-on experience building or running LLM evaluation systems benchmarks or experimentation pipelines
Strong coding ability (Python or similar) with experience building production-quality systems
Solid understanding of data structures algorithms and backend systems
Experience working with APIs databases (SQL/NoSQL) and cloud infrastructure
Ability to reason deeply about model behavior experimental results and system performance
Comfortable operating in ambiguous high-ownership environments with rapid iteration cycles

Preferred

Experience working on post-training RL or evaluation teams at AI labs or AI-first companies
Familiarity with LLM evaluation techniques benchmarking frameworks or agent evaluation systems
Experience with synthetic data generation rubric design or reward modeling workflows
Publications or research experience in ML especially in evaluation or benchmarking
Exposure to large-scale experimentation systems or model performance tracking infrastructure

Compensation and Benefits

Competitive base salary ($130K $400K) meaningful equity
Relocation and housing support available
Monthly meal stipend and premium wellness perks (e.g. fitness membership)
Comprehensive health insurance
Opportunity to work directly with frontier AI labs and influence model development at the cutting edge

Required Experience:

Apply Now

About Company

Recruiting From Scratch

Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click