At Canva our mission is to empower the world to design. Were building AI that feels magical and lands real impact for millions of people - helping anyone create with confidence. Were looking for a senior research scientist who lives and breathes reinforcement learning and agentic systems to push the frontier of reasoning tool use and reliability - and ship it to users.
About the team
We explore multimodal agentic architectures build scalable training and evaluation loops and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are a cutting-edge post-training team developing new multimodal agentic systems. We work on all topics of multimodal modelling post-training and design agents we build scalable training and evaluation loops and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are looking for a person with experience in post-training and reinforcement learning (RL) to join our team.
About the role
Youll drive research directions and play a leading role in handson work across the agent stackfrom reward design and policy optimization to planning memory and tool orchestration dataset construction to post-training and the development of novel post-training approaches. Youll design tight experiments iterate quickly and land trustworthy conclusions. Most importantly youll help convert research into reliable safe and highquality product experiences.
What youll be doing in this role
Develop agent systems (planning multimodal tool use retrieval novel training approaches modeling ablations) for real tasks in design vision and language.
Scale post-training and RL across distributed systems (PyTorch) with efficient data loaders tracing/telemetry stable training of mixture-of-experts (MoE) architectures and reproducible pipelines; profile debug and optimize.
Contribute to the research agenda for RL/agentic systems aligned with Canvas product goals; identify highleverage bets and retire dead ends quickly.
Build reward models and learning loops: RLHF/RLAIF preference modeling DPO/IPOstyle objectives offline/online RL curriculum learning and credit assignment for multistep reasoning.
Develop simulation and sandbox tasks that surface failure modes (planning errors tooluse brittleness hallucination unsafe actions) and turn them into measurable targets.
Help align on rigorous evaluation for agents (task success reliability latency safety regressions). Stand up offline suites and online A/B tests; favor simple controlled experiments that generalize.
Collaborate and ship: work shouldertoshoulder with product design safety and platform to land research as reliable featuresthen iterate.
Share and elevate: mentor teammates present findings internally and contribute back to the community when it helps the field and our users.
Youre likely a match if you have
Depth in implementing and post-training LLMs/VLMs/Diffusion models with a track record of shipped research or publications in agents/RL.
Experience modifying and adapting open-source models.
Strong experience with experimental design: tight baselines clean ablations reproducibility and clear databacked conclusions.
Fluency in Python and PyTorch; youre comfortable in large ML codebases and can profile debug and optimize training and inference.
Practical experience building agent loops (planning tool invocation retrieval memory) and evaluating multistep reasoning quality.
Handson experience with policy optimization reward modeling and preference learning (e.g. RLHF/RLAIF DPO/IPO actorcritic/PPO offline RL).
Experience with largescale training (distributed training experiment tracking evaluation harnesses) and cloud multimodal tooling.
Experience with RL for MoE architectures.
Nice to have
Experience with video and audio modelling.
Experience with multiagent settings.
Strength in alignment and safety evaluations including redteaming and risk mitigation for toolusing agents.
Contributions to opensource benchmarks or shared evaluation suites for agents.
Additional Information :
Whats in it for you
Achieving our crazy big goals motivates us to work hard - and we do - but youll experience lots of moments of magic connectivity and fun woven throughout life at Canva too. We also offer a stack of benefits to set you up for every success in and outside of work.
Heres a taste of whats on offer:
- Equity packages - we want our success to be yours too
- Inclusive parental leave policy that supports all parents & carers
- An annual Vibe & Thrive allowance to support your wellbeing social connection home office setup & more
- Flexible leave options that empower you to be a force for good take time to recharge and supports you personally
Check out for more info.
Other stuff to know
We make hiring decisions based on your experience skills and passion as well as how you can enhance Canva and our culture. When you apply please tell us the pronouns you use and any reasonable adjustments you may need during the interview process.
Please note that interviews are predominantly conducted virtually.
Remote Work :
No
Employment Type :
Full-time
At Canva our mission is to empower the world to design. Were building AI that feels magical and lands real impact for millions of people - helping anyone create with confidence. Were looking for a senior research scientist who lives and breathes reinforcement learning and agentic systems to push the...
At Canva our mission is to empower the world to design. Were building AI that feels magical and lands real impact for millions of people - helping anyone create with confidence. Were looking for a senior research scientist who lives and breathes reinforcement learning and agentic systems to push the frontier of reasoning tool use and reliability - and ship it to users.
About the team
We explore multimodal agentic architectures build scalable training and evaluation loops and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are a cutting-edge post-training team developing new multimodal agentic systems. We work on all topics of multimodal modelling post-training and design agents we build scalable training and evaluation loops and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are looking for a person with experience in post-training and reinforcement learning (RL) to join our team.
About the role
Youll drive research directions and play a leading role in handson work across the agent stackfrom reward design and policy optimization to planning memory and tool orchestration dataset construction to post-training and the development of novel post-training approaches. Youll design tight experiments iterate quickly and land trustworthy conclusions. Most importantly youll help convert research into reliable safe and highquality product experiences.
What youll be doing in this role
Develop agent systems (planning multimodal tool use retrieval novel training approaches modeling ablations) for real tasks in design vision and language.
Scale post-training and RL across distributed systems (PyTorch) with efficient data loaders tracing/telemetry stable training of mixture-of-experts (MoE) architectures and reproducible pipelines; profile debug and optimize.
Contribute to the research agenda for RL/agentic systems aligned with Canvas product goals; identify highleverage bets and retire dead ends quickly.
Build reward models and learning loops: RLHF/RLAIF preference modeling DPO/IPOstyle objectives offline/online RL curriculum learning and credit assignment for multistep reasoning.
Develop simulation and sandbox tasks that surface failure modes (planning errors tooluse brittleness hallucination unsafe actions) and turn them into measurable targets.
Help align on rigorous evaluation for agents (task success reliability latency safety regressions). Stand up offline suites and online A/B tests; favor simple controlled experiments that generalize.
Collaborate and ship: work shouldertoshoulder with product design safety and platform to land research as reliable featuresthen iterate.
Share and elevate: mentor teammates present findings internally and contribute back to the community when it helps the field and our users.
Youre likely a match if you have
Depth in implementing and post-training LLMs/VLMs/Diffusion models with a track record of shipped research or publications in agents/RL.
Experience modifying and adapting open-source models.
Strong experience with experimental design: tight baselines clean ablations reproducibility and clear databacked conclusions.
Fluency in Python and PyTorch; youre comfortable in large ML codebases and can profile debug and optimize training and inference.
Practical experience building agent loops (planning tool invocation retrieval memory) and evaluating multistep reasoning quality.
Handson experience with policy optimization reward modeling and preference learning (e.g. RLHF/RLAIF DPO/IPO actorcritic/PPO offline RL).
Experience with largescale training (distributed training experiment tracking evaluation harnesses) and cloud multimodal tooling.
Experience with RL for MoE architectures.
Nice to have
Experience with video and audio modelling.
Experience with multiagent settings.
Strength in alignment and safety evaluations including redteaming and risk mitigation for toolusing agents.
Contributions to opensource benchmarks or shared evaluation suites for agents.
Additional Information :
Whats in it for you
Achieving our crazy big goals motivates us to work hard - and we do - but youll experience lots of moments of magic connectivity and fun woven throughout life at Canva too. We also offer a stack of benefits to set you up for every success in and outside of work.
Heres a taste of whats on offer:
- Equity packages - we want our success to be yours too
- Inclusive parental leave policy that supports all parents & carers
- An annual Vibe & Thrive allowance to support your wellbeing social connection home office setup & more
- Flexible leave options that empower you to be a force for good take time to recharge and supports you personally
Check out for more info.
Other stuff to know
We make hiring decisions based on your experience skills and passion as well as how you can enhance Canva and our culture. When you apply please tell us the pronouns you use and any reasonable adjustments you may need during the interview process.
Please note that interviews are predominantly conducted virtually.
Remote Work :
No
Employment Type :
Full-time
View more
View less