Research Engineer – Benchmarking, Evals & Failure Analysis
San Francisco, CA - USA
Job Summary
Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130000 $400000 Equity
This fast-growing AI company is operating at the forefront of applied machine learning and labor transformation. By partnering with leading AI labs and enterprises they are building systems that combine human expertise with cutting-edge AI to improve model performance and unlock new categories of work. With strong revenue scale and backing from top-tier investors they are shaping how frontier models are trained evaluated and deployed in real-world environments.
What You Will Do- Design and implement benchmarking systems to evaluate model capabilities such as tool use reasoning and agent behavior
- Build and operate end-to-end evaluation pipelines including scoring systems dashboards and reporting infrastructure
- Conduct systematic failure analysis on model outputs identifying key failure modes and translating them into actionable improvements
- Develop rubrics evaluators and scoring frameworks that balance rigor with scalability (human automated evaluation)
- Partner with research and applied AI teams to align evaluation systems with training and product goals
- Analyze data quality and performance trends to inform model training data generation and post-training strategies
- Own evaluation and benchmarking systems in a fast-paced high-iteration environment
- Strong applied AI or ML engineering experience particularly in model evaluation benchmarking or failure analysis
- Hands-on experience building or running LLM evaluation systems benchmarks or experimentation pipelines
- Strong coding ability (Python or similar) with experience building production-quality systems
- Solid understanding of data structures algorithms and backend systems
- Experience working with APIs databases (SQL/NoSQL) and cloud infrastructure
- Ability to reason deeply about model behavior experimental results and system performance
- Comfortable operating in ambiguous high-ownership environments with rapid iteration cycles
- Experience working on post-training RL or evaluation teams at AI labs or AI-first companies
- Familiarity with LLM evaluation techniques benchmarking frameworks or agent evaluation systems
- Experience with synthetic data generation rubric design or reward modeling workflows
- Publications or research experience in ML especially in evaluation or benchmarking
- Exposure to large-scale experimentation systems or model performance tracking infrastructure
- Competitive base salary ($130K $400K) meaningful equity
- Relocation and housing support available
- Monthly meal stipend and premium wellness perks (e.g. fitness membership)
- Comprehensive health insurance
- Opportunity to work directly with frontier AI labs and influence model development at the cutting edge
This is a high-impact role at the intersection of engineering and applied AI research ideal for candidates excited about defining how next-generation models are evaluated improved and deployed at scale.
Required Experience:
IC
About Company
Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.