Software Engineer I AIML, AWS Neuron Distributed Training
Cupertino, CA - USA
Department:
Job Summary
In this role you will be responsible for the development enablement and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures Multimodal models that are transformer and diffusion based and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems collaborating closely with chip architects compiler engineers runtime engineers and AWS solution architects to deliver cost-effective performant machine learning solutions on AWS Trainium based systems.
Key job responsibilities
You will contribute to the design and implementation of distributed training solutions for large-scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP torchtitan and Hugging Face libraries for the Neuron ecosystem.
A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16 FP8 and emerging numerical formats to improve training throughput while maintaining model accuracy and convergence quality. This includes implementing precision-aware training strategies loss scaling techniques and careful gradient management to ensure training stability across reduced precision formats.
Beyond precision optimization you will profile analyze and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. You will partner with hardware compiler and runtime teams to understand system constraints and unlock new capabilities. Additionally you will collaborate with AWS solution architects and customers to support the deployment and optimization of training workloads at scale.
About the team
Annapurna Labs was a startup company acquired by AWS in 2015 and is now fully integrated. If AWS is an infrastructure company then think Annapurna Labs as the infrastructure provider of AWS. Our org covers multiple disciplines including silicon engineering hardware design and verification software and operations. AWS Nitro ENA EFA Graviton and F1 EC2 Instances AWS Neuron Inferentia and Trainium ML Accelerators and in storage with scalable NVMe are some of the products we have delivered over the last few years.
- Bachelors degree or above in computer science computer engineering or related field or Bachelors degree
- 1 years of programming experience with at least one software programming language (including academic projects internships or research)
- Experience with software development practices including code reviews source control testing and build processes
- Experience with machine learning concepts and at least one ML framework (PyTorch JAX or TensorFlow)
- Masters degree or above in computer science or equivalent
- Experience with large-scale distributed training or LLM workloads
- Experience with computer architecture or hardware-software co-optimization
- Experience with distributed systems libraries or frameworks
- Familiarity with end-to-end model training pipelines
- Previous internship or research experience in ML infrastructure or systems software
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.
Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees supervisors and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees supervisors and staff to ensure exceptional customer service; and follow all federal state and local laws and Company policies. Criminal history may have a direct adverse and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above as well as the abilities to adhere to company policies exercise sound judgment effectively manage stress and work safely and respectfully with others exhibit trustworthiness and professionalism and safeguard business operations and the Companys reputation. Pursuant to the Los Angeles County Fair Chance Ordinance we will consider for employment qualified applicants with arrest and conviction records.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at CA Cupertino - 127100.00 - 185000.00 USD annually
Required Experience:
IC
About Company
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more