Senior Machine Learning Engineer Research Optimisation
Job Summary
At Canva our mission is to empower the world to design. To get cutting-edge research into the hands of millions of users faster were looking for a Machine Learning Engineer focused on research enablement and performance turning promising experiments into stable scalable user-facing capabilities while making training and inference faster cheaper and more reliable.
About the role:
Youll be the bridge between research and production. Partnering closely with researchers youll ensure experimental code is production ready integrate models into our monorepo build shared libraries and services and create the tooling and processes that let multiple model variants ship safely and quickly. Youll also work across the training stack profiling and tuning PyTorch workloads improving GPU utilisation and shaping how we use distributed training and storage to get the most out of our compute. Your work shortens the research-to-user loop reduces duplication and ensures our ML features are reliable observable and easy for other teams to adopt.
At the moment this role is focused on:
Research-to-Production Pipeline: Hardening experimental models (containerisation tests CI/CD) making them deployable for real users.
Training Performance and GPU Efficiency: Profiling PyTorch training jobs improving GPU utilisation and applying techniques like mixed precision efficient data loading and distributed training strategies (FSDP DDP DeepSpeed) to reduce time and cost per experiment.
Library development: Converting experiments into well-factored libraries with clear APIs dependency hygiene and versioning so teams can import rather than copy-paste.
Developer Experience & Documentation: Creating templates examples and guidance; offering supportive high-signal communication so others can adopt libraries confidently.
Reliability Observability & Cost: Instrumenting services with metrics/logging/tracing setting SLIs/SLOs and optimising training and inference performance and spend.
Primary Responsibilities:
Productionise research models: refactor test containerise and integrate them into the monorepo for scalable reuse.
Profile and optimise PyTorch training jobs working with researchers to identify bottlenecks across compute memory I/O and networking.
Improve distributed training setups (multi-GPU multi-node) and help teams pick the right parallelism strategy for their workload.
Build and maintain inference services SDKs and shared libraries that standardise pre/post-processing and interfaces across variants.
Create multi-variant runners and rollout frameworks (feature flags canaries A/B testing automated rollbacks).
Establish CI/CD workflows artifact management and reproducible builds for ML services and model assets.
Add robust observability (dashboards alerts) and reliability practices (load tests chaos/resiliency checks) across training and inference workloads.
Optimise inference (batching caching quantisation/compilation hardware utilisation) to reduce latency and cost.
Work across the broader training stack including Kubernetes orchestration storage (e.g. Weka Vast Lustre) and data pipelines to remove friction for research teams.
Partner with researchers and product engineers via code reviews pair sessions and clear documentation to accelerate adoption.
Drive good engineering hygiene in the research codebase: testing strategy dependency management and de-duplication across multiple model variants.
Youre probably a match if you:
Have strong software engineering fundamentals and excellent Python skills; youre comfortable turning notebooks and prototypes into production-grade services.
Have shipped ML systems in production (containers APIs CI/CD) ideally within a monorepo environment.
Have hands-on experience optimising PyTorch training or inference profiling workloads and reasoning about GPU memory compute and throughput.
Are comfortable in containerised environments and understand Kubernetes concepts well enough to debug and improve ML workloads running on it.
Can read research code and refactor it into clean abstractions with stable well-documented interfaces.
Understand service reliability and observability (metrics tracing logging) and how they apply to ML systems.
Think holistically about the stack from storage and networking through to model code and can hold a credible conversation with researchers DevOps and platform engineers alike.
Communicate clearly and empathetically especially when guiding others to adopt libraries and best practices and mentoring engineers earlier in their ML journey.
Bring cloud experience (AWS a plus) without needing to be a deep specialist.
Nice to Have:
Familiarity with model-serving/optimisation tooling (e.g. ONNX TorchScript Triton quantisation).
Experience writing or optimising CUDA kernels or using compilation frameworks ( Triton TensorRT) to speed up models.
Experience with distributed training frameworks (FSDP DDP DeepSpeed Megatron) at meaningful scale.
Familiarity with high-performance storage systems (Weka Vast Lustre) and the data loading patterns that make or break training throughput.
Experience with experimentation platforms (feature flags A/B testing) and safe rollout strategies.
Background with multimodal/image generation stacks or LLM-adjacent tooling (not the core focus but helpful).
Knowledge of MLOps practices (model registries artifact stores dependency/version management).
Impact youll have:
Youll dramatically reduce the time it takes to move from a successful experiment to a reliable observable feature in production. Youll eliminate copy-paste unify interfaces enable parallel variants and build the shared foundations that let Canva ship ML innovation at scale. Youll also help our research teams get more out of every GPU hour making training faster and inference cheaper as we scale up the work CORE is doing.
Additional Information :
Whats in it for you
Achieving our crazy big goals motivates us to work hard - and we do - but youll experience lots of moments of magic connectivity and fun woven throughout life at Canva too. We also offer a range of benefits to set you up for every success in and outside of work.
Heres a taste of whats on offer:
- Equity packages - we want our success to be yours too
- Inclusive parental leave policy that supports all parents & carers
- An annual Vibe & Thrive allowance to support your wellbeing social connection office setup & more
- Flexible leave options that empower you to be a force for good take time to recharge and supports you personally
Check out for more info.
Other stuff to know
We make hiring decisions based on your experience skills and passion as well as how you can enhance Canva and our culture. We see AI as a powerful amplifier of creativity and technology at Canva. Were evolving how we assess AI skills in our Technology hiring experience - youll tackle interactive real-time challenges that reflect the kind of work we some interviews you may also be asked to solve a problem using an AI tool to show how you approach challenges with tech by your side.
Please note that interviews are conducted virtually.
Remote Work :
No
Employment Type :
Full-time
About Company
We're a global online visual communications platform on a mission to empower the world to design. Featuring a simple drag-and-drop user interface and a vast range of templates ranging from presentations, documents, websites, social media graphics, posters, apparel to videos, plus a hu ... View more