AI Hardware Systems Manager, Annapurna Labs, Trainium Machine Learning Fleet Operations
Austin, TX - USA
Job Summary
In Annapurna Labs we are at the forefront of hardware/software co-design not just in Amazon Web Services (AWS) but across the industry. The Machine Learning Acceleration Fleet Operations Team is looking for a technical leader to manage a team of 5-10 engineers and own operations across multiple ML server platforms spanning tens of thousands of hosts globally.
We are seeking a manager who combines strong technical depth in hardware systems and software development with proven people leadership. You will build and grow a high-performing team set technical direction for fleet-scale automation and tooling and drive operational excellence across some of the most advanced server hardware in existence. You will define your teams 6-12 month roadmap influence org-level priorities and represent fleet operations in VP-level reviews. You are equally comfortable debugging a complex hardware failure as you are coaching an engineer through a career development conversation.
Our team has end to end ownership of some of the most advanced server hardware in the world. We drive technical debug efforts and write truly massive scale autonomous software to monitor optimize and remediate machine learning hardware. Come define how we operate the future of ML infrastructure.
Key job responsibilities
- Build hire mentor and grow a team of platform development engineers responsible for ML fleet operations across multiple accelerator platforms
- Define team roadmap and technical strategy for fleet health automation and data infrastructure balancing near-term operational demands against long-term engineering investments
- Drive operational excellence by establishing metrics SLAs and processes that maximize platform sellability and customer experience
- Partner with hardware engineering software engineering and product teams to prioritize debug efforts and translate fleet learnings into permanent design fixes
- Own escalation paths for critical fleet incidents and lead cross-functional war rooms to resolution
- Influence org-level priorities by surfacing fleet-wide patterns and advocating for systemic improvements across the ML hardware portfolio
- Raise the bar on team software practices ensuring automation is maintainable tested documented and reusable at scale
- Represent fleet operations in executive reviews providing data-driven narratives on platform health and roadmap
A day in the life
As a Manager on the MLA Fleet Operations team you set the direction for how your team keeps the worlds most advanced ML accelerators healthy at scale.
You start each day with your people holding 1:1s coaching engineers through ambiguous technical problems removing blockers and ensuring the team is focused on the highest-impact work. From there you review fleet health with the team understanding which issues are trending which investigations need unblocking and where to allocate engineering effort for maximum customer impact. You partner with hardware design teams to advocate for fleet-informed design changes and with service teams to align on deployment schedules. You balance long-term automation investments against near-term operational demands and you represent your teams work to senior leadership with clear data and crisp narratives. When critical incidents arise you lead the response marshaling the right people driving root cause and ensuring corrective actions land.
About the team
The MLA Fleet Operations team was formed to maintain an exceptionally high quality bar for our fleet of advanced machine learning accelerators and server products. We perfect the customer experience by developing scalable software for rapid incident response times and data visualization as well as diving deep into hardware issues as they arise.
- Bachelors degree in computer science electrical engineering or related field
- 2 years of engineering team management experience
- Knowledge of and proficiency in the use of Python scripting language
- Experience with general troubleshooting/debugging of hardware
- Experience designing building operating and managing large-scale distributed systems or web services
- 7 years of experience in systems engineering platform engineering SRE or hardware operations
- Experience in automating deploying and supporting large-scale infrastructure
- Experience in server technologies such as thermal mechanical power and signal integrity
- Experience working cross-functionally across several teams both technical and non-technical
- Experience with GPU ML accelerator or high-performance computing hardware
- Experience managing teams through ambiguity on new or unreleased products
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at TX Austin - 175100.00 - 236900.00 USD annually
Required Experience:
Manager
About Company
Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more