Research Engineer – Benchmarking, Evals & Failure Analysis

Mercor

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 130 - 500

Posted on: 4 days ago

Vacancies: 1 Vacancy

Department:

Engineering

Job Summary

About Mercor

Mercor is defining the future of work. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge experience and context that cant be captured in code alone. Today more than 30000 experts in our network collectively earn over $2 million a day.

Mercor is creating a new category of work where expertise powers AI advancement. Achieving this requires an ambitious fast-paced and deeply committed team. Youll work alongside researchers operators and AI companies at the forefront of shaping the systems that are redefining society. Mercor is a profitable Series C company valued at $10 billion. We work in-person five days a week in our San Francisco NYC or London offices.

About the Role

As a Research Engineer at Mercor youll work at the intersection of engineering and applied AI research. Youll own benchmarking pipelines evaluation systems and failure analysis workflows that directly inform how we train and improve frontier language models.
Your work will define how we measure tool use agentic behavior and real-world reasoning. Youll design and run evals build rubrics and scorers and turn failure analysis into actionable improvements for post-training RLVR and data pipelines.

What Youll Do

Benchmarking: Design implement and maintain benchmarks and metrics for tool use agentic behavior and real-world reasoning; ensure benchmarks scale with training and stay aligned with product and research goals.
Evaluation systems: Build and operate LLM evaluation systems end-to-end runs scoring dashboards and reporting so researchers and applied AI teams can track model performance and compare runs at scale.
Failure analysis: Run systematic failure analysis on model outputs (e.g. wrong tool use reasoning errors safety/alignment issues); categorize failure modes quantify prevalence and feed findings into reward design data curation and benchmark design.
Rubrics and evaluators: Create and refine rubrics automated evaluators and scoring frameworks that drive training and evaluation decisions; balance rigor with scalability (human vs. model-as-judge calibration agreement).
Data quality and usability: Quantify data usability quality and impact on key benchmarks; use evals and failure analysis to guide data generation augmentation and curation.
Cross-team collaboration: Work with AI researchers applied AI teams and data producers to align evals with training objectives and to prioritize benchmarks and failure analyses that matter most.
Ownership in a fast-paced environment: Operate in a high-iteration research setting with strong ownership of benchmarks evals and failure-analysis workflows.

What Were Looking For

Strong applied research background with focus on model evaluation benchmarking and/or failure analysis.
Strong coding skills and hands-on experience with ML models and evaluation code.
Solid grasp of data structures algorithms and backend systems.
Comfort with APIs SQL/NoSQL and cloud platforms for running and storing eval results.
Ability to reason about model behavior experimental results and data quality from evals and failure analyses.
Excitement to work in person in San Francisco five days a week in a high-intensity high-ownership environment.

Nice To Have

Industry experience on a post-training or evaluation/benchmarking team (highest priority).
Publications at top-tier venues (NeurIPS ICML ACL) especially in evaluation or benchmarking.
Experience building or running LLM evaluations benchmarks or failure-analysis pipelines.
Experience with synthetic data generation rubric design or RL-style workflows that use evals for reward shaping.
Work samples or code (e.g. eval frameworks benchmark suites failure-analysis reports or tooling) that demonstrate relevant skills.

Benefits

Generous equity grant vested over 4 years
A $10K housing bonus (if you live within 0.5 miles of our office)
A $1.5K monthly stipend for meals
Free Equinox membership
Health insurance

Required Experience:

About MercorMercor is defining the future of work. We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development. Our vast talent network trains frontier AI models in the same way teachers teach students: by sharing knowledge experience and context tha...

About Mercor

About the Role

What Youll Do

Benchmarking: Design implement and maintain benchmarks and metrics for tool use agentic behavior and real-world reasoning; ensure benchmarks scale with training and stay aligned with product and research goals.
Evaluation systems: Build and operate LLM evaluation systems end-to-end runs scoring dashboards and reporting so researchers and applied AI teams can track model performance and compare runs at scale.
Failure analysis: Run systematic failure analysis on model outputs (e.g. wrong tool use reasoning errors safety/alignment issues); categorize failure modes quantify prevalence and feed findings into reward design data curation and benchmark design.
Rubrics and evaluators: Create and refine rubrics automated evaluators and scoring frameworks that drive training and evaluation decisions; balance rigor with scalability (human vs. model-as-judge calibration agreement).
Data quality and usability: Quantify data usability quality and impact on key benchmarks; use evals and failure analysis to guide data generation augmentation and curation.
Cross-team collaboration: Work with AI researchers applied AI teams and data producers to align evals with training objectives and to prioritize benchmarks and failure analyses that matter most.
Ownership in a fast-paced environment: Operate in a high-iteration research setting with strong ownership of benchmarks evals and failure-analysis workflows.

What Were Looking For

Strong applied research background with focus on model evaluation benchmarking and/or failure analysis.
Strong coding skills and hands-on experience with ML models and evaluation code.
Solid grasp of data structures algorithms and backend systems.
Comfort with APIs SQL/NoSQL and cloud platforms for running and storing eval results.
Ability to reason about model behavior experimental results and data quality from evals and failure analyses.
Excitement to work in person in San Francisco five days a week in a high-intensity high-ownership environment.

Nice To Have

Industry experience on a post-training or evaluation/benchmarking team (highest priority).
Publications at top-tier venues (NeurIPS ICML ACL) especially in evaluation or benchmarking.
Experience building or running LLM evaluations benchmarks or failure-analysis pipelines.
Experience with synthetic data generation rubric design or RL-style workflows that use evals for reward shaping.
Work samples or code (e.g. eval frameworks benchmark suites failure-analysis reports or tooling) that demonstrate relevant skills.

Benefits

Generous equity grant vested over 4 years
A $10K housing bonus (if you live within 0.5 miles of our office)
A $1.5K monthly stipend for meals
Free Equinox membership
Health insurance

Required Experience:

Apply Now

About Company

Mercor

Find top-tier, remote, AI roles for your expertise. Available only on Mercor.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Research Engineer – Benchmarking, Evals & Failure Analysis

San Francisco, CA - USA

Department:

Job Summary

About Mercor

About the Role

What Youll Do

What Were Looking For

Nice To Have

Benefits

About Mercor

About the Role

What Youll Do

What Were Looking For

Nice To Have

Benefits

About Company

Related Jobs