Training builds capability. Post-training decides what it becomes.
This team are rethinking how large multimodal models learn after pre-training developing post-training and reinforcement learning methods that help models reason plan and interact in real time.
Founded by the researchers behind several of the most influential modern AI architectures this lab are pushing alignment and learning efficiency beyond standard RLHF. Theyre scaling preference-based training (RLHF DPO hybrid feedback loops) to new model types and creating systems that learn from interaction rather than static data.
Youll work at the intersection of post-training RL and model architecture designing reward models scalable evaluation frameworks and training strategies that make large-scale learning measurable and reliable. Its applied research with direct impact supported by serious compute and a tight researcher-to-GPU ratio.
Youll bring experience in large-scale post-training or reinforcement learning (RLHF DPO or SFT pipelines) a solid grasp of LLM or multimodal training systems and the curiosity to explore new optimisation and alignment methods. A publication record at top venues (NeurIPS ICLR ICML CVPR ACL) is a plus but impact matters more than titles.
The team are based in San Francisco working mostly in person. $1 million total compensation. Base salary circa $300K $600K (negotiable) plus stock and bonus exact package depends on experience.
If you want to work where post-training meets architecture shaping how foundation models learn reason and adapt this is that opportunity.
All applicants will receive a response.
Required Experience:
IC