About Graphcore
At Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investment we are delivering key technology into the fast-growing SoftBank AI meet the vast and exciting AI opportunity Graphcore is expanding its teams around the are bringing together the brightest minds to solve the toughest problems in a place where everyone has the opportunity to make an impact on the company our products and the future of artificial intelligence.
Job Summary
Responsible for system-level reliability of AI servers with liquid cooling and HVDC architectures owning reliability validation shock & vibration robustness and failure analysis from board to rack level to ensure safe transport deployment and long-term datacenter operation.
Key Responsibilities and skills
- Plan and execute reliability validation across board server and rack levels.
- Define and run environmental accelerated and mechanical tests including thermal/power cycling humidity corrosion shock & vibration and HALT/HASS.
- Lead shock & vibration validation for transportation handling seismic and operational conditions.
- Assess reliability risks for liquid cooling systems (leakage fatigue pump life corrosion coolant stability).
- Evaluate HVDC mechanical and electrical robustness (busbars connectors power interfaces).
- Perform reliability prediction and life data analysis (Weibull MTBF).
- Lead cross-functional design reviews and drive risk mitigation.
- Conduct failure analysis and RCA using standard FA methodologies.
- Define andmaintainreliability and S&V test specifications (JEDEC Telcordia GR-63 JESD22 MIL-STD-810 ISTA ASHRAE UL IEC).
- ImplementOn-going Reliability Test (ORT) for production quality.
- Document results and support customer audits and certifications.
Qualifications
- Bachelors orMasters degree in Mechanical Electrical Reliability Materials or related Engineering.
- 10 years of reliability engineering experience in AI servers datacenter systems HPC or complex electronics.
- Hands-on experience with environmental shock and vibration testing.
- Strong knowledge of reliability methodologies and statistical analysis.
- Practical experience with liquid cooling and HVDC systems.
- Proven failure analysis and RCA capability.
- Strong communicationskills in English; Mandarin a plus.
Preferred Experience
- AI server architecture and large-scale liquid cooling systems.
- FEA/modal analysis and test correlation.
- Datacenter telecom and transportation standards knowledge.
- Reliability certification (e.g. ASQ CRE).
Benefits
In addition to a competitive salary Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; were committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.
Required Experience:
IC
About GraphcoreAt Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investmen...
About Graphcore
At Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investment we are delivering key technology into the fast-growing SoftBank AI meet the vast and exciting AI opportunity Graphcore is expanding its teams around the are bringing together the brightest minds to solve the toughest problems in a place where everyone has the opportunity to make an impact on the company our products and the future of artificial intelligence.
Job Summary
Responsible for system-level reliability of AI servers with liquid cooling and HVDC architectures owning reliability validation shock & vibration robustness and failure analysis from board to rack level to ensure safe transport deployment and long-term datacenter operation.
Key Responsibilities and skills
- Plan and execute reliability validation across board server and rack levels.
- Define and run environmental accelerated and mechanical tests including thermal/power cycling humidity corrosion shock & vibration and HALT/HASS.
- Lead shock & vibration validation for transportation handling seismic and operational conditions.
- Assess reliability risks for liquid cooling systems (leakage fatigue pump life corrosion coolant stability).
- Evaluate HVDC mechanical and electrical robustness (busbars connectors power interfaces).
- Perform reliability prediction and life data analysis (Weibull MTBF).
- Lead cross-functional design reviews and drive risk mitigation.
- Conduct failure analysis and RCA using standard FA methodologies.
- Define andmaintainreliability and S&V test specifications (JEDEC Telcordia GR-63 JESD22 MIL-STD-810 ISTA ASHRAE UL IEC).
- ImplementOn-going Reliability Test (ORT) for production quality.
- Document results and support customer audits and certifications.
Qualifications
- Bachelors orMasters degree in Mechanical Electrical Reliability Materials or related Engineering.
- 10 years of reliability engineering experience in AI servers datacenter systems HPC or complex electronics.
- Hands-on experience with environmental shock and vibration testing.
- Strong knowledge of reliability methodologies and statistical analysis.
- Practical experience with liquid cooling and HVDC systems.
- Proven failure analysis and RCA capability.
- Strong communicationskills in English; Mandarin a plus.
Preferred Experience
- AI server architecture and large-scale liquid cooling systems.
- FEA/modal analysis and test correlation.
- Datacenter telecom and transportation standards knowledge.
- Reliability certification (e.g. ASQ CRE).
Benefits
In addition to a competitive salary Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; were committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.
Required Experience:
IC
View more
View less