Sr. System Development Engineer, Cloud AIMLstorage server teams

Amazon

Not Interested
Bookmark
Report This Job

profile Job Location:

Cupertino, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 9 hours ago
Vacancies: 1 Vacancy

Job Summary

We are seeking an experienced Systems Development Engineer to lead the development of automation software diagnostic tooling and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable reliable systems that keep our storage and accelerated (AI/ML) compute fleet healthy with a vision toward zero-touch operations where automation detects diagnoses and resolves issues without human intervention.

You will be a technical leader solving complex architectural problems that may not be well-defined in advance. You will own your teams systems proactively identify deficiencies write scalable and robust code to solve issues before they impact customers. You will decompose large difficult server testability reliability and diagnosis problems into straightforward tasks and components leading delivery yourself and through others in parallel using a combination of hardware software system design processor architecture diagnostics and operations knowledge.

You will collaborate with a variety of roles (SDEs SDETs Mechanical/Electrical/Hardware Engineers TPMs Managers Principals) and organizations through server conception test validation qualification launch and operations driving high quality and reliability into current and future designs for AWS server solutions. You will also work closely with ODMs and Design Partners to ensure our tooling diagnostics and automation requirements are met throughout the hardware development lifecycle (NPI).

Key job responsibilities
Fleet Health & Predictive Infrastructure
- Build and own the automation infrastructure responsible for the health of the server fleet across storage and accelerator (AI/ML) compute platforms
- Design and implement predictive failure detection systems using telemetry sensor data error trending and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations building automation that detects diagnoses triages and remediates hardware and software faults without human intervention
- Develop monitoring tools dashboards and alerting systems to provide real-time visibility into fleet health across lab and production environments
- Define and track fleet health metrics (failure rates mean time to detect mean time to repair first-time fix rate predictive accuracy)

Debugging & Troubleshooting
- Debug and resolve complex system-level issues across storage compute GPU networking in production environments
- Troubleshoot Linux boot and runtime failures across x86 and ARM architectures including PCIe power NIC NVMe and GPU subsystems
- Perform root cause analysis on hardware failures correlating across firmware kernel driver and physical layer to isolate faults
- Build diagnostic tooling that automates root cause identification and reduces reliance on manual triage

Systems Development & Automation
- Lead the definition and development of software automation and enabling tools for server hardware programs; track and report progress
- Design and build scalable system-level software with focus on durability availability security and diagnostics
- Develop and maintain device drivers for Linux on ARM and x86 architectures
- Build automation solutions using modern programming languages (Python Ruby Java C/C etc.)
- Work with OS internals storage subsystems and accelerator/GPU software stacks in Linux-based environments
- Build manage and deploy CI/CD pipelines for rapid deployment of code changes to org-owned and customer-owned systems

Cross-Team Collaboration
- Work across internal HWEng teams to ensure new server hardware addresses data path and control path functionality needed by dependent service teams
- Work closely with internal customers to identify early any potential problems onboarding new servers storage or accelerated compute into their ecosystem
- Engage with ODMs and design partners on testability diagnostic and automation requirements during hardware design and development
- Contribute to server design to improve robustness testability diagnosability and reliability
- Partner with datacenter operations teams to close the loop between field failures and design improvements

A day in the life
Systems Development Engineers in AWS Hardware Engineering wear many hats. From orchestration tooling development to hardware integration to kernel driver debugging we dive deep into problems across the breadth of AWS. Our teams are directly responsible for launching and maintaining server hardware in the fleet including storage servers powering distributed storage platforms and AI/ML accelerator servers with GPUs. Located in Seattle and Cupertino we work with internal development teams ODMs and design partners to deliver servers deployed in datacenters worldwide.

- 6 years of non-internship professional software development experience
- 6 years of systems design software development operations automation and process improvement experience
- 6 years of designing or architecting (design patterns reliability and scaling) of new and existing systems experience
- 5 years of programming with at least one modern language such as C C# Java Python Golang PowerShell Ruby experience
- Experience with Linux/Unix
- Experience leading the design build and deployment of complex and performant (reliable and scalable) software solutions in production

- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle including coding standards code reviews source control management build processes testing certification and livesite operations
- Experience taking a leading role in building complex software or computing infrastructure that has been successfully delivered to customers
- 7 years of professional experience
- Experience building predictive failure detection or proactive remediation systems at fleet scale
- Experience with Linux kernel driver development
- Experience with storage compute GPU/accelerator platforms (NVIDIA) including driver integration diagnostics or performance validation
- Experience with distributed storage systems (block object or file)
- - Familiarity with server hardware architecture BMC/IPMI firmware PCIe topology NVLink and hardware diagnostics
- Experience working with ODMs or hardware design partners through the product development lifecycle
- Experience building zero-touch or self-healing automation for large-scale infrastructure
- Experience working in large-scale datacenter or cloud environments
- Track record of rapidly coming up to speed on new engineering disciplines and making impactful decisions
- Experience with hardware bring-up validation and fleet-wide deployment
- Familiarity with telemetry pipelines anomaly detection and operational metrics at scale

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status disability or other legally protected status.

Los Angeles County applicants: Job duties for this position include: work safely and cooperatively with other employees supervisors and staff; adhere to standards of excellence despite stressful conditions; communicate effectively and respectfully with employees supervisors and staff to ensure exceptional customer service; and follow all federal state and local laws and Company policies. Criminal history may have a direct adverse and negative relationship with some of the material job duties of this position. These include the duties and responsibilities listed above as well as the abilities to adhere to company policies exercise sound judgment effectively manage stress and work safely and respectfully with others exhibit trustworthiness and professionalism and safeguard business operations and the Companys reputation. Pursuant to the Los Angeles County Fair Chance Ordinance we will consider for employment qualified applicants with arrest and conviction records.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process including support for the interview or onboarding process please visit for more information. If the country/region youre applying in isnt listed please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience qualifications and location. Amazon also offers comprehensive benefits including health insurance (medical dental vision prescription Basic Life & AD&D insurance and option for Supplemental life plans EAP Mental Health Support Medical Advice Line Flexible Spending Accounts Adoption and Surrogacy Reimbursement coverage) 401(k) matching paid time off and parental leave. Learn more about our benefits at CA Cupertino - 173900.00 - 235200.00 USD annually
USA WA Seattle - 151200.00 - 204600.00 USD annually


Required Experience:

Senior IC

We are seeking an experienced Systems Development Engineer to lead the development of automation software diagnostic tooling and fleet health infrastructure for our server platforms. You will work across multiple teams and organizations to build scalable reliable systems that keep our storage and ac...
View more view more

About Company

Company Logo

Free shipping on millions of items. Get the best of Shopping and Entertainment with Prime. Enjoy low prices and great deals on the largest selection of everyday essentials and other products, including fashion, home, beauty, electronics, Alexa Devices, sporting goods, toys, automotive ... View more

View Profile View Profile