Research Engineer — Reinforcement Learning
San Francisco, CA - USA
Job Summary
Research Engineer Reinforcement Learning
Youll bring reinforcement learning to Firecrawls core product building the training infrastructure reward pipelines and fine-tuning systems that make our models meaningfully better at extracting understanding and structuring web data. This isnt theoretical RL research. Youll build your own training infra run fast experiments ship models to production and bridge the gap between classical RL approaches and modern LLM agent systems. If you care as much about training throughput as you do about reward design this is the role.
Salary Range: $180000$290000/year (Range shown is for U.S.-based employees. Compensation outside the U.S. is adjusted fairly based on your countrys cost of living. You can explore how we calculate this here: Range: Up to 0.15%
Location: San Francisco CA or Remote (Americas UTC-3 to UTC-10)
Job Type: Full-Time
Experience: 3 years in applied RL ML engineering or model training with production systems
Visa: US Citizenship/Visa required for SF; N/A for Remote
About Firecrawl
Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API just a year weve hit 8 figures in ARR and 100k GitHub stars by building the fastest way for developers to get LLM-ready data.
Were a small fast-moving technical team building essential infrastructure superintelligence will use to gather data on the web. We ship fast and deep.
What Youll Do
Build training infrastructure and reward pipelines from scratch. Design and operate the systems that train and evaluate Firecrawls models. Youll own the full loop data collection reward modeling training runs evaluation and deployment. You build the infra yourself because youre the one who needs it to work.
Fine-tune models to achieve state-of-the-art results. Take foundation models and make them dramatically better at web data extraction content understanding and structured output generation. You know how to get from decent fine-tune to best-in-class and you have the patience and rigor to close that gap.
Bridge LLM agents and classical RL. The most interesting problems at Firecrawl sit at the intersection of modern LLM-based agents and classical RL techniques. Youll design reward signals for agent behaviors apply RL methods to improve multi-step agent workflows and figure out where traditional RL approaches outperform prompting and vice versa.
Run fast experiments and iterate. You design experiments that test meaningful hypotheses run them quickly and make decisions based on results. You dont spend weeks on experiment infrastructure before getting a single result. Speed of iteration is a core part of how you work.
Communicate clearly to non-RL people. RL can be opaque. You translate your work into language that engineers product people and leadership can understand and act on. You know how to explain why a reward function matters without requiring everyone to read the paper.
Collaborate closely with the team. Work directly with the Search/IR-focused Research Engineer and the engineering team to connect RL improvements with search ranking and the broader product roadmap.
What Were Looking For
Builds their own training infra and reward pipelines. You dont wait for an ML platform team to set things up. You build the training loops reward models data pipelines and evaluation frameworks yourself because you understand that infra choices directly affect the quality of results. Youve operated GPU clusters managed training runs and debugged convergence issues in production.
Can fine-tune models to SOTA. Youve taken models from baseline to best-in-class on tasks that matter. You understand the full fine-tuning lifecycle data curation training dynamics hyperparameter sensitivity evaluation methodology and you have the taste to know when a model is actually good versus when the eval is flattering.
Bridges LLM agents and classical RL. Youre fluent in both worlds. You understand PPO RLHF reward modeling and policy optimization and you understand how modern LLM agents work where they fail and how RL techniques make them better. You see connections between these domains that most people miss.
Production-minded. You care about whether your models work in production not just on benchmarks. Youve deployed models that serve real traffic and made hard tradeoffs between model quality latency and cost. Research that doesnt ship isnt research that matters here.
Runs fast experiments and communicates clearly. Youd rather run three rough experiments this week than one polished one next month. When you have results anyone on the team can understand what they mean no decoder ring required.
Backgrounds that tend to do well: RL engineers at AI labs or applied ML teams whove shipped models to production. Researchers whove done RLHF or reward modeling for LLM systems. ML engineers whove built training infrastructure at startups and cared as much about the pipeline as the model. People whove worked at the intersection of RL and language models whether in academic labs with a production bent or at companies building agent systems.
What Were NOT Looking For
Pure theorists. If your best RL work lives in a paper and youve never trained a model on real data at real scale this isnt the role. We need someone who builds and ships.
Researchers who need a platform team. If you expect training infrastructure data pipelines and evaluation frameworks to be set up before you can be productive youll be frustrated here. You build the tools you need.
People who only know one paradigm. Deep in classical RL but never worked with LLMs LLM fine-tuner whos never touched RL Youll be missing half the picture. This role requires fluency in both.
Slow iterators. If your standard experiment cycle is measured in weeks not days youll struggle with the pace. We need someone who can run a meaningful experiment interpret results and decide next steps within a day or two.
Black-box communicators. If your typical update is a wall of metrics only another RL researcher can parse this isnt the right fit. We need someone who can explain whats working whats not and why it matters to people without RL PhDs.
A Note On Pace
We operate at an absurd level of urgency because the window for what were building wont stay open forever. If that excites you keep reading. If it doesnt no hard feelings but this role probably isnt for you.
Benefits & Perks
Available to all employees
Salary that makes sense $180000$290000/year based on impact not tenure
Own a piece Up to 0.15% equity in what youre helping build
Generous PTO 15 days mandatory anything after 24 days just ask (holidays excluded); take the time you need to recharge
Parental leave 12 weeks fully paid for moms and dads
Wellness stipend $100/month for the gym therapy massages or whatever keeps you human
Learning & Development Expense up to $1000/year toward anything that helps you grow professionally
Team offsites A change of scenery minus the trust falls
Sabbatical 3 paid months off after 4 years do something fun and new
Available to US-based full-time employees
Full coverage no red tape Medical dental and vision (100% for employees 50% for spouse/kids) no weird loopholes just care that works
Life & Disability insurance Employer-paid short-term disability long-term disability and life insurance coverage for lifes curveballs
Supplemental options Optional accident critical illness hospital indemnity and voluntary life insurance for extra peace of mind
Doctegrity telehealth Talk to a doctor from your couch
401(k) plan Retirement might be a ways off but future-you will thank you
Pre-tax benefits Access to FSAs and commuter benefits (US-only) to help your wallet out a bit
Pet insurance Because fur babies are family too
Available to SF-based employees
SF HQ perks Snacks drinks team lunches intense ping pong and peak startup energy
E-Bike transportation A loaner electric bike to get you around the city on us
Interview Process
Application Review Send us your work and a quick note on why this excites you. Show us what youve trained models reward systems training pipelines. Published work is great; shipped production models are better.
Intro Chat (20 min) - A quick conversation to get to know each other before we go deep. Well talk about what youve been working on what drew you to Firecrawl and what youre looking for in your next role. Time for your questions too.
Technical Deep Dive (60 min) Go deep on RL and model training work youve done: training infrastructure decisions reward design fine-tuning approaches production deployment. Well explore a live problem how youd apply RL to improve an LLM agent workflow at Firecrawl. Were looking for depth across classical RL and modern LLM techniques production instincts and fast reasoning.
Founder Chat (30 min) Culture pace ownership and how you like to work. Time for your questions too.
Paid Work Trial (12 weeks) Tackle a real RL/fine-tuning problem with production implications. We evaluate on technical depth experiment velocity and how clearly you communicate results.
Decision We move fast after the trial.
If you want to bring RL to one of the most interesting applied problems in AI making agents smarter at understanding and extracting web data at scale this is your shot.
Apply now.
Required Experience:
IC
About Company
The web crawling, scraping, and search API for AI. Built for scale. Firecrawl delivers the entire internet to AI agents and builders. Clean, structured, and ready to reason with.