Cloud Hardware Development Engineer, Cloud AIMLstorage server teams

Amazon

Not Interested
Bookmark
Report This Job

profile Job Location:

Cupertino, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 22 hours ago
Vacancies: 1 Vacancy

Job Summary

As a Cloud Hardware Development Engineer you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms from New Product Introduction (NPI) through fleet health in production. You own the full lifecycle: design development qualification launch and ongoing operational excellence of servers running at scale in the AWS fleet.

You will work closely with internal customers to understand their technical needs and business goals leveraging your experience with server design and the knowledge of various teams to architect solutions we deploy at scale. To deliver your products you will work with an interdisciplinary team of component firmware power mechanical electrical test qualification manufacturing engineers and lead our ODM (design and manufacturing partners) to bring these servers to the data center. After launch you own the fleet monitoring quality driving reliability improvements and ensuring servers continue to meet customer requirements throughout their
operational life.

This role demands deep technical curiosity and the willingness to jump in and personally solve the hardest problems. When a complex system failure occurs whether during NPI qualification or in a production fleet of hundreds of thousands of servers you roll up your sleeves dive into the details across hardware firmware software and physical layers and drive to root cause. You dont wait for someone else to figure it out.

You will own end-to-end system reliability proactively identifying deficiencies and driving toward zero-touch operations where automation detects diagnoses and resolves issues before customer impact. You will decompose complex server system problems (testability reliability diagnostics) into deliverable tasks and features leading delivery yourself and through others in parallel.

This is a fast-paced intellectually challenging position. Youll work with thought leaders in multiple technology areas hold high standards for yourself and everyone you work with and constantly look for ways to improve your products performance quality and cost. Were changing an industry and we want individuals who are ready for this challenge and want to reach beyond what is possible today.


Key job responsibilities
NPI New Product Introduction
- Own the end-to-end NPI lifecycle for storage and/or accelerator (AI/ML/GPU) server platforms from architecture definition through design qualification manufacturing ramp and launch
- Lead technical solutions for complex server and rack system architectural challenges
- Work with ODM/manufacturing partners to develop validate and manufacture server products at scale
- Develop functional specifications design verification plans and test procedures
- Drive qualification and readiness milestones ensuring new platforms meet performance reliability and cost targets before fleet deployment
- Identify and resolve technical risks early in the development cycle dont let problems reach production

Fleet Health Diagnostics & Automation
- Own fleet health for the server platforms you launch reliability doesnt end at ship
- Design and implement predictive failure detection systems using telemetry sensor data error trending and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations help build detection diagnoses and remediation of faults without human intervention
- Debug complex system failures in time-sensitive settings personally diving deep when the problem demands it
- Perform root cause analysis correlating across firmware kernel driver thermal power and physical layers

Systems Design & Technical Depth
- Apply expertise across hardware software system design x86 architecture processes and operations (compute storage network GPU)
- Design and implement solutions to address system-level issues at large scale
- Decompose complex server system problems (testability reliability diagnostics) into deliverable tasks and features
- Collaborate with hardware software manufacturing supply chain and product management teams

Cross-Team Collaboration
- Work closely with internal customers to ensure new server hardware meets data path and control path requirements
- Identify early any potential problems onboarding new servers into customer ecosystems
- Collaborate across Hardware Engineering component firmware test qualification and integration teams
- Partner with datacenter operations to close the loop between field failures and design improvements

A day in the life
Your day-to-day responsibilities include interfacing with internal and external customers to understand product requirements and facilitate system development on top of your server designs. You will learn operational challenges facing our existing fleet with the goal of improving the current customer experience and developing improved systems for future designs. You will work directly with vendors and ODM (manufacture partners) to scale your product. Some days youre reviewing a new platform design with your ODM; other days youre deep in logs and telemetry data chasing a failure mode across the fleet. You thrive
on that range.

- Experience in developing functional specifications design verification plans and functional test procedures
- Bachelors degree or above in electrical engineering computer engineering or equivalent
- Experience in English-language communication skills both written and verbal
- Experience with design & innovation and research & development
- Knowledge of operating systems hardware storage network security database administration and cloud infrastructure
- Experience in server technologies such as thermal mechanical power and signal integrity
- 5 years of professional work (non-internship) experience

- 5 years of hardware design and validation of components subsystems and systems experience
- Experience in server technologies: board design high-speed bus design and signal integrity failure analysis server components (CPU GPU SSDs memory) BIOS BMC and networking
- Experience developing and executing test procedures for mechanical or electrical systems/components
- Experience working with ODMs/manufacturer through the product development and manufacturing lifecycle
- Experience building predictive failure detection or proactive remediation systems at fleet scale
- Experience with storage/compute/GPU/accelerator platforms including integration diagnostics or performance validation
- Familiarity with PCIe topology NVLink NVMe and accelerator interconnects
- Experience with large-scale datacenter or cloud environments

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees supervisors and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees supervisors and staff to ensure exceptional customer service; and follow all federal state and local laws and Company policies. Criminal history may have a direct adverse and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above as well as the abilities to adhere to company policies exercise sound judgment effectively manage stress and work safely and respectfully with others exhibit trustworthiness and professionalism and safeguard business operations and the Companys reputation. Pursuant to the Los Angeles County Fair Chance Ordinance we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at CA Cupertino - 157300.00 - 212800.00 USD annually
USA WA Seattle - 136000.00 - 184000.00 USD annually


Required Experience:

IC

As a Cloud Hardware Development Engineer you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms from New Product Introduction (NPI) through fleet health in production. You own the full lifecycle: design development qualification launch and ongoing operational exc...
View more view more

About Company

Company Logo

Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more

View Profile View Profile