ML Kernel Performance Engineer, Edge AI and Science
Department:
Job Summary
Within Edge AI & Science the AI Platform team builds a compression platformthe first of its kindenabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost.
We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training fine-tuning and inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team enabling scientists and engineers to profile diagnose and fix kernel bottlenecks without needing to be CUDA experts themselves.
Working alongside compression scientists and platform engineers you will ensure that novel quantization schemes (ternary nonary mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference.
Key job responsibilities
Design and implement high-performance CUDA and Triton kernels for quantization-aware training sparse matrix operations and low-bit inference on modern GPU accelerators
Analyze and optimize kernel-level performance for compression training workloads conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks that slow model training from days to weeks
Implement kernel-level optimizations such as operator fusion tiling memory access pattern optimization and scheduling for compression-specific compute patterns
Build a kernel development harness that enables any team member to profile kernel performance test forward/backward accuracy and validate at production scale lowering the bar from CUDA expert to any engineer with agents
Maintain and extend the teams training kernels library with clean interfaces CI and examples that enable scientists to contribute kernel improvements alongside platform engineers
Collaborate closely with Applied Scientists compiler engineers and hardware architects to co-design ML-centric solutions that unify software and hardware for both cloud and edge deployment
Develop inference kernels for cloud deployment (custom backends for quantized models that keep weights packed in memory and reconstruct on the fly for compute)
Build and maintain performance regression tests and benchmarking infrastructure that track kernel efficiency as models scale from billions to hundreds of billions of parameters
A day in the life
A scientist files a ticket: QAT training on our large model is 4x slower than expected. You pull up the profiler identify that a custom quantizer kernel is thrashing shared memory at scale write a Triton replacement that tiles correctly for the layer shapes at that model size validate accuracy in the test harness and push it to the kernels repo. By end of day the training run that was taking four days now takes one.
You will also build the tooling that makes this workflow repeatable by others. You will participate in design discussions with Applied Scientists translate their algorithmic ideas into efficient GPU implementations and work in a startup-like environment where every engineering hour directly accelerates the teams ability to ship compressed models.
About the team
The AI Platform team builds Amazons neural network compression platform. We compress models using knowledge distillation network restructuring and advanced quantization to achieve 20-100x compression while preserving model quality. Our platform packages these into automated pipelines that deploy to both custom edge silicon and GPU-based cloud inference.
As model sizes grow the proprietary advantage shifts from the science to the software (making it work at hundreds of billions of parameters is the moat). GPU kernel performance is the biggest single lever on training throughput and we expect AI-assisted development tooling to significantly multiply engineering productivity meaning a small team with the right harness can operate at the scale of a much larger one.
The ML Kernel Performance Engineer bridges science and platforms: you turn algorithmic innovations into production-grade GPU code that runs at scale. You will work alongside Applied Scientists compiler engineers hardware architects and platform developers in a small agile team building the next generation of edge AI for Amazons consumer products.
- 3 years of non-internship professional software development experience
- 2 years of non-internship design or architecture (design patterns reliability and scaling) of new and existing systems experience
- Experience with CUDA kernels or ML/low-level kernels or experience in developing and deploying LLMs in production on GPUs Neuron TPU or other AI acceleration hardware
- Experience with programming languages such as Python Java C
- Bachelors degree in computer science or equivalent
- 3 years of full software development life cycle including coding standards code reviews source control management build processes testing and operations experience
- Experience with GPU kernel optimization and GPGPU computing (CUDA Triton SYCL or ROCm)
- Proficiency in low-level performance optimization for GPUs
- Understanding of GPU memory hierarchies and optimization strategies (shared memory L1/L2 cache register pressure memory coalescing)
- Experience developing high-performance libraries for ML or HPC applications
- Knowledge of ML frameworks (PyTorch TensorFlow) and their GPU backends
- Experience implementing custom PyTorch operators ( C extensions)
- Experience with parallel programming and optimization techniques
- Background in neural network compression (quantization pruning knowledge distillation low-rank factorization)
- Knowledge of mixed-precision training and inference (FP16 BF16 FP8 INT8 INT4)
- Experience with inference optimization (TensorRT ONNX Runtime vLLM or similar)
- Familiarity with Transformer architectures attention mechanisms and their compute/memory profiles
- Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
- Experience with edge deployment model compilation or hardware-aware optimization
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.
The base salary range for this position is listed below. As a total compensation company Amazons package may include other elements such as sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon offers comprehensive benefits including health insurance (medical dental vision prescription basic life & AD&D insurance) Registered Retirement Savings Plan (RRSP) Deferred Profit Sharing Plan (DPSP) paid time off and other resources to improve health and well-being. We thank all applicants for their interest however only those interviewed will be advised as to hiring status.
CAN BC Vancouver - 114800.00 - 191800.00 CAD annually
Required Experience:
IC
About Company
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more