ML Accelerator Performance Validation Engineer, Post Silicon Validation

Amazon

Job Location:

Austin, TX - USA

Monthly Salary: Not Disclosed

Posted on: 30+ days ago

Vacancies: 1 Vacancy

Department:

Software Development

Job Summary

Annapurna Labs an AWS organization with development centers in the U.S. and Israel builds custom silicon and software for AWS customers. Our team combines cloud-scale innovation with world-class expertise across silicon engineering hardware design verification software and operations to tackle technical challenges that have never been seen before.

Join our Post-Silicon Validation team to quantify and qualify the performance of AWSs custom ML training chips against architectural targets. Youll bridge the gap between silicon capabilities and real-world ML workload demands ensuring our accelerators deliver on latency throughput and efficiency promises at cloud scale.

Youll work in a fast-paced startup-like environment alongside some of the brightest minds in the industry on next generation AI/ML hardware that powers AWSs training and inference infrastructure. Your analysis will directly shape architectural decisions for next-generation accelerators and determine when silicon is ready for production deployment.

Key job responsibilities
Design and execute performance benchmarks spanning micro-architectures to full model training

Measure and analyze compute throughput memory bandwidth interconnect latency and more

Profile real ML workloads (transformer models LLMs vision models) on silicon

Identify performance bottlenecks and work with architecture teams on optimization

Build automated performance regression dashboards and tracking infrastructure

Correlate silicon measurements against RTL simulation and emulation predictions

A day in the life
Your primary focus is measuring and understanding how our AI chips perform under real workloads. Youll spend mornings digging into benchmark results figuring out where cycles are being lost and why throughput isnt hitting targets. When something looks off youll instrument the hardware profile the pipeline and work with design teams to get it fixed. Some days youll be developing and running full training models end-to-end; others youll be building the dashboards that tell leadership whether silicon is ready to ship.

About the team
The MLA Post-Silicon Validation team owns validation of AWSs next-generation ML training accelerators from first silicon through production deployment in AWS data centers. We sit at the intersection of hardware firmware and ML software ensuring every layer of the stack performs scales and meets the quality bar. Our team culture values deep technical ownership data-driven decisions and a bias for action. We operate with startup agility backed by AWS-scale resources and our work directly enables the cloud computing infrastructure that millions of customers rely on for AI/ML workloads.

- 3 years of non-internship professional software development experience
- 2 years of non-internship design or architecture (design patterns reliability and scaling) of new and existing systems experience
- Experience with Machine Learning and Large Language Model fundamentals including architecture training/inference lifecycles and optimization of model execution or experience working with PyTorch or JAX software
- Bachelors degree in computer science engineering mathematics or equivalent or experience in Java C Python or a related language
- 3 years of experience with hardware performance counters and profiling tools for analyzing and optimizing system and application performance
- Strong understanding of computer architecture fundamentals including memory hierarchies (caches DRAM HBM) compute pipelines and interconnect topologies
- Experience applying statistical methods regression analysis and data visualization techniques to interpret performance data and drive optimization decisions

- 3 years of full software development life cycle including coding standards code reviews source control management build processes testing and operations experience
- Experience with CUDA kernels or ML/low-level kernels or experience in developing and deploying LLMs in production on GPUs Neuron TPU or other AI acceleration hardware
- Experience in developing and deploying LLMs in production on GPUs Neuron TPU or other AI acceleration hardware or experience with CUDA kernels or ML/low-level kernels
- Knowledge of collective communications (AllReduce AllGather) and scaling
- Experience with HBM PCIe and/or DMA bandwidth characterization

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at TX Austin - 143700.00 - 194400.00 USD annually

Required Experience:

Apply Now

About Company

Amazon

Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click