Research Engineer – Evals

Firecrawl

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 160 - 240

Posted on: Yesterday

Vacancies: 1 Vacancy

Job Summary

Research Engineer Evals

Youll build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isnt. Our core promise convert any URL into clean structured LLM-ready data reliably is hard to measure rigorously across millions of different websites formats and edge cases. As we layer in models and agent workflows the question did that work gets harder not easier.

This isnt an eval role where you inherit a framework and run benchmarks. Youll design the metrics build the pipelines generate the datasets and own the feedback loop from output quality back to model and product decisions. If you care about what good actually means and have the engineering depth to measure it this is the role.

Location: San Francisco CA (Hybrid) OR Remote (Americas UTC-3 to UTC-10) Employment Type: Full time Department: Engineering Team Compensation: $160K $240K 0.01% 0.10%

Salary Range: $160000 to $240000/year (Range shown is for U.S.-based employees in San Francisco CA. Compensation outside the U.S. is adjusted fairly based on your countrys cost of living.)

Equity Range: Up to 0.10%

Location: San Francisco CA or Remote (Americas UTC-3 to UTC-10)

Job Type: Full-Time

Experience: 3 years in ML engineering applied AI or data quality with production systems

Visa: US Citizenship/Visa required for SF; N/A for Remote

About Firecrawl

Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API just a year weve hit millions in ARR and 50k GitHub stars by building the fastest way for developers to get LLM-ready data.

Previously we built Mendable one of the first commercially available chat with your data applications. We sold to companies like MongoDB Coinbase Snapchat and more. To do this we spent a surprising amount of time building reliable infrastructure for getting clean data from the web. When we started to see our founding friends rebuilding the same thing we thought we might be on to something.

Why Firecrawl

Technical ownership Lead critical browser technology and infrastructure
Real impact Directly shape how our browser stack drives our entire product
High velocity Rapid iteration and deployment of your work
Small team big ambition Collaborate closely with founders influencing key decisions and future directions

What Youll Do

Build the eval stack from scratch. Design and own the systems that measure whether Firecrawls outputs are actually good across scrape crawl extract and map. That means defining metrics building pipelines curating datasets and integrating evals into CI/CD so regressions get caught before they ship. You build the infra yourself because youre the one who needs it to work.

Design benchmarks that reflect reality. Our outputs need to hold up across millions of websites SPAs paywalled content dynamic rendering structured and unstructured formats. Youll build benchmark datasets that cover the real distribution of what our customers send us including the edge cases that break naive approaches. Ground truth doesnt come for free youll design the collection and labeling systems too.

Own LLM-as-judge pipelines. Youll design and validate automated judges that score extraction quality at scale know the failure modes of LLM-based evaluation and build the human review tooling needed when automation isnt enough. You understand the difference between an eval that measures something real and one that just flatters the system.

Close the loop with models and RL. Evals here arent a reporting layer theyre a training signal. Youll work closely with the RL and Search/IR research engineers to turn quality measurements into reward signals and feedback loops that make models meaningfully better. Your benchmarks directly influence what gets trained next.

Run fast experiments and communicate clearly. You design experiments that test meaningful hypotheses run them quickly and make decisions based on results. When you have findings anyone on the team can understand what they mean no decoder ring required.

What Were Looking For

Builds their own eval infrastructure. You dont wait for tooling to appear. You write the pipelines curate the datasets design the rubrics and validate the judges yourself because you understand that infra choices directly affect what youre actually measuring. Youve run evals at scale and debugged the places where they lie.

Knows what good means for unstructured web data. Youve worked with messy real-world data before. You understand why markdown quality is hard to define why structured extraction fidelity varies by schema and why naive string-match metrics miss the point. You have strong opinions about what a useful benchmark actually looks like and the rigor to validate them.

Fluent in LLM evaluation methodology. You understand LLM-as-judge systems their correlation with human judgment and where they break down. Youve designed rubrics that hold up under adversarial inputs built human review pipelines that scale and know how to measure inter-rater agreement. Youre not fooled by evals that only look good in aggregate.

Production-minded. You care about whether your evals reflect real production behavior not just offline benchmarks. Youve worked on systems serving real traffic and made hard tradeoffs between evaluation depth coverage and cost. A benchmark that doesnt represent what customers actually send isnt a benchmark worth maintaining.

Fast and clear. Youd rather run three rough experiments this week than one polished one next month. When you have results anyone on the team can understand what they mean and what to do next.

Backgrounds that tend to do well: ML engineers whove built eval or data quality systems at AI labs or applied teams. Engineers whove worked on LLM fine-tuning or RLHF pipelines and understand how feedback quality drives model improvement. People whove worked at the intersection of data infrastructure and model development. Anyone whos been the person on the team asking but how do we know this actually works

What Were NOT Looking For

Benchmark runners. If your eval experience is running existing frameworks on existing benchmarks and reporting numbers this isnt the right fit. We need someone who builds the frameworks and defines the benchmarks.

People who treat evals as an afterthought. If your default workflow is to build first and evaluate later or to treat pass rates as a proxy for actual quality youll struggle here. Evals are a first-class product not a QA gate.

Researchers who need a platform team. If you expect pipelines datasets and labeling infrastructure to exist before you can be productive youll be frustrated. You build the tools you need.

Slow iterators. If your standard experiment cycle is measured in weeks not days youll struggle with the pace. We need someone who can design run and interpret a meaningful experiment within a day or two.

Bonus Points

Any other niche expertise and skills
Previous experience at a scraping automation or security-focused startup
Ex-founder

What it Means to Join Firecrawl

High Leverage Your processes directly amplify our growth.
Autonomy Own your domain; we care about outcomes not hours.
Remote-First Culture Work at our new SF office while collaborating with our remote team.
Growth Opportunity Early equity and a role that scales with the company.
Creative Freedom Experiment with new channels formats and automations. If it works we run with it.

Benefits & Perks

Available to all employees

Salary that makes sense $00/year (U.S.-based) based on impact not tenure
Own a piece Up to 0.15% equity in what youre helping build
Unlimited PTO Minimum 3 weeks off encouraged; take the time you need to recharge
Parental leave 12 weeks fully paid for moms and dads
Wellness stipend $100/month for the gym therapy massages or whatever keeps you human
Learning & Development - Expense up to $150/year toward anything that helps you grow professionally
Team offsites A change of scenery minus the trust falls
Sabbatical 3 paid months off after 4 years do something fun and new

Available to US-based full-time employees

Full coverage no red tape Medical dental and vision (100% for employees 50% for spouse/kids) no weird loopholes just care that works
Life & Disability insurance Employer-paid short-term disability long-term disability and life insurance coverage for lifes curveballs
Supplemental options Optional accident critical illness hospital indemnity and voluntary life insurance for extra peace of mind
Doctegrity telehealth Talk to a doctor from your couch
401(k) plan Retirement might be a ways off but future-you will thank you
Pre-tax benefits Access to FSAs and commuter benefits to help your wallet out a bit
Pet insurance Because fur babies are family too

Available to SF-based employees

SF HQ perks Snacks drinks team lunches and the occasional burst of chaotic startup energy

Interview Process

Application Review Send us your stuff and a quick note on why youre excited
Automated Assessment (30 min) - We will do an initial automated assessment of your skills and knowledge.
Intro Chat (25 min) Quick alignment call with a member of our team
Technical Interview (1 hr) Tackle a small challenge
Interview with Founders (30 min) Culture vision and long-term fit
Paid Work Trial (12 weeks) Work on something real with us
Decision We move fast

If youve ever wanted to own a product-critical system and build alongside founders this is your moment. Apply now and lets talk.

Required Experience:

Research Engineer EvalsYoull build the evaluation systems that tell us whether Firecrawl actually works. That sounds simple. It isnt. Our core promise convert any URL into clean structured LLM-ready data reliably is hard to measure rigorously across millions of different websites formats and edge...

Research Engineer Evals

Salary Range: $160000 to $240000/year (Range shown is for U.S.-based employees in San Francisco CA. Compensation outside the U.S. is adjusted fairly based on your countrys cost of living.)

Equity Range: Up to 0.10%

Location: San Francisco CA or Remote (Americas UTC-3 to UTC-10)

Job Type: Full-Time

Experience: 3 years in ML engineering applied AI or data quality with production systems

Visa: US Citizenship/Visa required for SF; N/A for Remote

About Firecrawl

Why Firecrawl

Technical ownership Lead critical browser technology and infrastructure
Real impact Directly shape how our browser stack drives our entire product
High velocity Rapid iteration and deployment of your work
Small team big ambition Collaborate closely with founders influencing key decisions and future directions

What Youll Do

What Were Looking For

Fast and clear. Youd rather run three rough experiments this week than one polished one next month. When you have results anyone on the team can understand what they mean and what to do next.

What Were NOT Looking For

Researchers who need a platform team. If you expect pipelines datasets and labeling infrastructure to exist before you can be productive youll be frustrated. You build the tools you need.

Bonus Points

Any other niche expertise and skills
Previous experience at a scraping automation or security-focused startup
Ex-founder

What it Means to Join Firecrawl

High Leverage Your processes directly amplify our growth.
Autonomy Own your domain; we care about outcomes not hours.
Remote-First Culture Work at our new SF office while collaborating with our remote team.
Growth Opportunity Early equity and a role that scales with the company.
Creative Freedom Experiment with new channels formats and automations. If it works we run with it.

Benefits & Perks

Available to all employees

Salary that makes sense $00/year (U.S.-based) based on impact not tenure
Own a piece Up to 0.15% equity in what youre helping build
Unlimited PTO Minimum 3 weeks off encouraged; take the time you need to recharge
Parental leave 12 weeks fully paid for moms and dads
Wellness stipend $100/month for the gym therapy massages or whatever keeps you human
Learning & Development - Expense up to $150/year toward anything that helps you grow professionally
Team offsites A change of scenery minus the trust falls
Sabbatical 3 paid months off after 4 years do something fun and new

Available to US-based full-time employees

Full coverage no red tape Medical dental and vision (100% for employees 50% for spouse/kids) no weird loopholes just care that works
Life & Disability insurance Employer-paid short-term disability long-term disability and life insurance coverage for lifes curveballs
Supplemental options Optional accident critical illness hospital indemnity and voluntary life insurance for extra peace of mind
Doctegrity telehealth Talk to a doctor from your couch
401(k) plan Retirement might be a ways off but future-you will thank you
Pre-tax benefits Access to FSAs and commuter benefits to help your wallet out a bit
Pet insurance Because fur babies are family too

Available to SF-based employees

SF HQ perks Snacks drinks team lunches and the occasional burst of chaotic startup energy

Interview Process

Application Review Send us your stuff and a quick note on why youre excited
Automated Assessment (30 min) - We will do an initial automated assessment of your skills and knowledge.
Intro Chat (25 min) Quick alignment call with a member of our team
Technical Interview (1 hr) Tackle a small challenge
Interview with Founders (30 min) Culture vision and long-term fit
Paid Work Trial (12 weeks) Work on something real with us
Decision We move fast

If youve ever wanted to own a product-critical system and build alongside founders this is your moment. Apply now and lets talk.

Required Experience:

Apply Now

About Company

Firecrawl

The web crawling, scraping, and search API for AI. Built for scale. Firecrawl delivers the entire internet to AI agents and builders. Clean, structured, and ready to reason with.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click

AI Resume Builder

Create an ATS-ready CV in minutes

AI Cover Letter

Write a personalized letter instantly

Research Engineer – Evals

San Francisco, CA - USA

Job Summary

Research Engineer Evals

About Firecrawl

Why Firecrawl

What Youll Do

What Were Looking For

What Were NOT Looking For

Bonus Points

What it Means to Join Firecrawl

Benefits & Perks

Available to all employees

Available to US-based full-time employees

Available to SF-based employees

Interview Process

Research Engineer Evals

About Firecrawl

Why Firecrawl

What Youll Do

What Were Looking For

What Were NOT Looking For

Bonus Points

What it Means to Join Firecrawl

Benefits & Perks

Available to all employees

Available to US-based full-time employees

Available to SF-based employees

Interview Process

About Company

Related Jobs