Senior Machine Learning Engineer, On-Device & Mobile AI Optimization

Unity Technologies

Job Location:

San Francisco, CA - USA

Monthly Salary: Not Disclosed

Posted on: 21 hours ago

Vacancies: 1 Vacancy

Job Summary

The opportunity
We are building the next generation of AI-driven game experiences running generativemodels on-device right where the players are on phones tablets laptops anddesktops. Our games run inside a modern browser-native runtime (built on technologiessuch as WebGPU and WebNN) so the models that power these experiences must be deployedand accelerated entirely within that runtime. As a Senior Machine Learning Engineer forOn-Device & Mobile AI you will take state-of-the-art multi-modal models transformersdiffusion networks and vision-language models (VLMs) and make them run fast small and reliably on mobile and constrained hardware.

This is a deeply hands-on role. You will own the optimization and deployment of significantparts of the inference stack from a trained checkpoint leaving research through exportquantization and kernel-level tuning to a shipped feature running inside the engine atinteractive frame rates within a fixed memory and power budget. Your work directly shapesthe latency quality memory footprint and battery profile of AI features experienced bybillions of players.

This role is for an engineer who is energized by the gap between a research model and ashipping on-device product. If you enjoy profilers frame captures op-fusion and shavingmilliseconds and megabytes this is your role.

What youll be doing

Inference & On-Device Optimization
Own the optimization pipeline for the models you ship: model export graphtransformation operator fusion memory-layout planning and hardware-specific tuningacross NPU mobile GPU and desktop/laptop GPU.
Apply quantization (INT4/INT8/FP16) weight sharing structured/unstructured pruningand knowledge distillation to hit hard latency memory and power budgets and validatethem against quality bars.
Do low-level performance work: write and tune WebGPU compute shaders (WGSL) andwhere relevant native kernels (Metal Vulkan/SPIR-V compute CUDA); profile with browserand platform tools (Chrome/Dawn GPU traces PIX Instruments/Metal System Trace
Snapdragon Profiler Nsight RenderDoc) and eliminate bottlenecks at the op andmemory-bandwidth level.
Apply efficiency techniques dynamic resolution token reduction cross-framecaching/reuse reduced-step diffusion samplers as engineering levers to meet budgetson target SKUs.
Runtime & Systems Integration
Work with WebGPU-targeted inference runtimes (ONNX Runtime Web ) alongside native options (CoreML ONNX Runtime TFLiteExecuTorch) and extend or build glue code where off-the-shelf options fall short of ourdiffusion and VLM workloads.
Build parts of the integration between the ML runtime and the game engine: real-timescheduling memory pooling zero-copy buffer sharing between the inference and renderpaths and frame-budget management alongside the renderer.
Build supporting engineering for your components: model packaging and asset pipelineson-device fallbacks and SKU-aware capability tiers crash/quality telemetry and automatedon-device benchmarking in CI.
Research Productionization
Partner with research scientists to turn novel CV and multi-modal architectures intoimplementations that are deployable debuggable and fast on device.
Provide a feedback loop into research: surface hardware constraints op-support gaps andcost models early so model design and deployment converge.
Track breakthroughs in efficient inference (efficient attention distillation reduced-stepdiffusion) and assess them pragmatically: what actually moves latency/memory/power onour target devices.
Collaboration & Engineering Quality
Contribute to engineering best practices code-review standards performance-regressiongates and on-device benchmarking methodology.
Support a culture of measurement: track KPIs for latency quality memory and power forthe systems you work on across the device matrix.
Partner with platform engineers product managers and runtime teams to align your workwith device-SKU constraints and product roadmaps.
Share knowledge and mentor junior and mid-level engineers through code review pairingand design discussion.

What were looking for

5 years in software/ML engineering with meaningful time focused on on-device / edgeinference or real-time performance-critical systems.
Production deployment of transformer- and/or diffusion-based models (e.g. ViT StableDiffusion CLIP/SigLIP-style encoders) on mobile desktop or embedded hardware shipped not just prototyped.
Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT WebCoreML TFLite ExecuTorch) and a working understanding of operator fusion memorylayout and runtime scheduling.
Low-level performance engineering: solid command of at least one GPU/compute API WebGPU/WGSL Metal Vulkan D3D12 or CUDA and the profiling tools to go with it. You can read a frame capture and a kernel trace and reason about where the time and memorygo.
Working knowledge of model-optimization techniques quantization (INT4/INT8/FP16)weight sharing pruning and distillation and the judgment to apply them to hit latencyand memory budgets. You use them effectively as engineering tools.
Understanding of target hardware: mobile SoCs (Apple Neural Engine QualcommHexagon/Adreno ARM Mali) and/or desktop/laptop GPUs (Apple Silicon NVIDIA AMDIntel).
Strong Python for export pipelines and training-side tooling; familiarity with the corelanguages of a browser-native runtime (TypeScript/JavaScript WGSL) is a plus.
Working fluency with the models you deploy enough to read an architecture modify it fordeployment and reason about accuracy trade-offs.
A collaborative working style: clear communication reliable delivery and a willingness tosupport and learn from teammates.

You might also have

Experience shipping world-model neural-rendering or real-time generative pipelinesNeRF 3DGS real-time diffusion or similar) on device.
Hands-on experience deploying models through WebGPU e.g. ONNX Runtime WebWebGPU EP) WebLLM or including writing/tuning WGSLcompute shaders.
Game-engine or real-time-graphics background (Unity Unreal or a custom engine;Metal/Vulkan/D3D/OpenGL ES render pipelines) especially integrating computeworkloads alongside a renderer.
Contributions to open-source ML inference frameworks runtimes or GPU/compute librariesespecially in the WebGPU ecosystem (Dawn wgpu ORT Web WebLLM).
Familiarity with compiler stacks (MLIR TVM IREE XLA) for custom kernel generation andgraph optimization.
Experience with on-device benchmarking infrastructure performance-regression CI anddevice-farm matrices.
Proficiency in C/Objective-C/Swift for runtime integration.

Additional information

Relocation support is not available for this position
Work visa/immigration sponsorship is not available for this position

Benefits
At Unity we want our team members to thrive. We offer a wide range of benefits designed to support well-being and work-life balance.

Please note: Benefits eligibility specific offerings and coverage vary based on the country and employment status.

While specific benefits vary here are some of the ways we strive to take care of our eligible team members globally: Comprehensive health life and disability insurance Commute subsidy Employee stock ownership Competitive retirement/pension plans Generous vacation and personal days Support for new parents through leave and family-care programs Office food snacks Mental Health and Wellbeing programs and support Employee Resource Groups Global Employee Assistance Program Training and development programs Volunteering and donation matching program

Life at Unity
Unity NYSE: U is the worlds leading game engine powering play for more than 3 billion consumers each month. The top mobile games in the world the most played PC indie titles the most innovative console games and virtually all of the top XR and Web Games are developed deployed and grown in Unity. Unity also enables teams across industries like automotive manufacturing and healthcare to design simulate and collaborate in 3D closing the gap between ideas and reality. For more information please visit .

Unity is a proud equal opportunity employer. We are committed to fostering an inclusive innovative environment and celebrate our employees across age race color ancestry national origin religion disability sex gender identity or expression sexual orientation or any other protected status in accordance with applicable law. Our differences are strengths that enable us to support the growing and evolving needs of our customers partners and collaborators. If you have a disability that means there are preparations or accommodations we can make to help ensure you have a comfortable and positive interview experience please fill out this form to let us know.

Pursuant to the San Francisco Fair Chance Ordinance we will consider for employment qualified applicants with arrest and conviction records.

This position requires the incumbent to have a sufficient knowledge of English to have professional verbal and written exchanges in this language since the performance of the duties related to this position requires frequent and regular communication with colleagues and partners located worldwide and whose common language is English.

Headhunters and recruitment agencies may not submit resumes/CVs through this Web site or directly to managers. Unity does not accept unsolicited headhunter and agency resumes. Unity will not pay fees to any third-party agency or company that does not have a signed agreement with Unity.

Your privacy is important to us. Please take a moment to review our Prospect Privacy Policy and Applicant Privacy Policy. Should you have any concerns about your privacy please contact us at

#SEN #LI-MC1

Required Experience:

Senior IC

The opportunityWe are building the next generation of AI-driven game experiences running generativemodels on-device right where the players are on phones tablets laptops anddesktops. Our games run inside a modern browser-native runtime (built on technologiessuch as WebGPU and WebNN) so the models t...

What youll be doing

Inference & On-Device Optimization
Own the optimization pipeline for the models you ship: model export graphtransformation operator fusion memory-layout planning and hardware-specific tuningacross NPU mobile GPU and desktop/laptop GPU.
Apply quantization (INT4/INT8/FP16) weight sharing structured/unstructured pruningand knowledge distillation to hit hard latency memory and power budgets and validatethem against quality bars.
Do low-level performance work: write and tune WebGPU compute shaders (WGSL) andwhere relevant native kernels (Metal Vulkan/SPIR-V compute CUDA); profile with browserand platform tools (Chrome/Dawn GPU traces PIX Instruments/Metal System Trace
Snapdragon Profiler Nsight RenderDoc) and eliminate bottlenecks at the op andmemory-bandwidth level.
Apply efficiency techniques dynamic resolution token reduction cross-framecaching/reuse reduced-step diffusion samplers as engineering levers to meet budgetson target SKUs.
Runtime & Systems Integration
Work with WebGPU-targeted inference runtimes (ONNX Runtime Web ) alongside native options (CoreML ONNX Runtime TFLiteExecuTorch) and extend or build glue code where off-the-shelf options fall short of ourdiffusion and VLM workloads.
Build parts of the integration between the ML runtime and the game engine: real-timescheduling memory pooling zero-copy buffer sharing between the inference and renderpaths and frame-budget management alongside the renderer.
Build supporting engineering for your components: model packaging and asset pipelineson-device fallbacks and SKU-aware capability tiers crash/quality telemetry and automatedon-device benchmarking in CI.
Research Productionization
Partner with research scientists to turn novel CV and multi-modal architectures intoimplementations that are deployable debuggable and fast on device.
Provide a feedback loop into research: surface hardware constraints op-support gaps andcost models early so model design and deployment converge.
Track breakthroughs in efficient inference (efficient attention distillation reduced-stepdiffusion) and assess them pragmatically: what actually moves latency/memory/power onour target devices.
Collaboration & Engineering Quality
Contribute to engineering best practices code-review standards performance-regressiongates and on-device benchmarking methodology.
Support a culture of measurement: track KPIs for latency quality memory and power forthe systems you work on across the device matrix.
Partner with platform engineers product managers and runtime teams to align your workwith device-SKU constraints and product roadmaps.
Share knowledge and mentor junior and mid-level engineers through code review pairingand design discussion.

What were looking for

5 years in software/ML engineering with meaningful time focused on on-device / edgeinference or real-time performance-critical systems.
Production deployment of transformer- and/or diffusion-based models (e.g. ViT StableDiffusion CLIP/SigLIP-style encoders) on mobile desktop or embedded hardware shipped not just prototyped.
Hands-on experience with at least one major inference runtime (ONNX Runtime / ORT WebCoreML TFLite ExecuTorch) and a working understanding of operator fusion memorylayout and runtime scheduling.
Low-level performance engineering: solid command of at least one GPU/compute API WebGPU/WGSL Metal Vulkan D3D12 or CUDA and the profiling tools to go with it. You can read a frame capture and a kernel trace and reason about where the time and memorygo.
Working knowledge of model-optimization techniques quantization (INT4/INT8/FP16)weight sharing pruning and distillation and the judgment to apply them to hit latencyand memory budgets. You use them effectively as engineering tools.
Understanding of target hardware: mobile SoCs (Apple Neural Engine QualcommHexagon/Adreno ARM Mali) and/or desktop/laptop GPUs (Apple Silicon NVIDIA AMDIntel).
Strong Python for export pipelines and training-side tooling; familiarity with the corelanguages of a browser-native runtime (TypeScript/JavaScript WGSL) is a plus.
Working fluency with the models you deploy enough to read an architecture modify it fordeployment and reason about accuracy trade-offs.
A collaborative working style: clear communication reliable delivery and a willingness tosupport and learn from teammates.

You might also have

Experience shipping world-model neural-rendering or real-time generative pipelinesNeRF 3DGS real-time diffusion or similar) on device.
Hands-on experience deploying models through WebGPU e.g. ONNX Runtime WebWebGPU EP) WebLLM or including writing/tuning WGSLcompute shaders.
Game-engine or real-time-graphics background (Unity Unreal or a custom engine;Metal/Vulkan/D3D/OpenGL ES render pipelines) especially integrating computeworkloads alongside a renderer.
Contributions to open-source ML inference frameworks runtimes or GPU/compute librariesespecially in the WebGPU ecosystem (Dawn wgpu ORT Web WebLLM).
Familiarity with compiler stacks (MLIR TVM IREE XLA) for custom kernel generation andgraph optimization.
Experience with on-device benchmarking infrastructure performance-regression CI anddevice-farm matrices.
Proficiency in C/Objective-C/Swift for runtime integration.

Additional information