We are seeking highly motivated and experienced engineers to join our team. The ideal candidate will have a deep understanding of machine learning systems and cloud computing infrastructure. Key responsibilities in this role are: - Engage with ML researchers to optimize end-to-end performance of large scale distributed ML workloads - Analyze workload metrics to identify sources of inefficiencies and work with users to understand and optimize ML workloads - Conduct workload analysis based on benchmarking key workloads on deployed systems - Improve large scale training resiliency by optimizing applications and frameworks for improved recovery from failures and preemptions - Influence architecture design development and operations of next generation ML accelerator systems based on workload insights
- Experience working with large scale parallel and distributed accelerator-based systems
- Experience optimizing performance and AI workloads at scale
- Experience developing code in one or more of training frameworks (such as PyTorch TensorFlow or JAX)
- Strong communicator with ability to analyze complex and ambiguous problems
- Programming and software design skills (proficiency in C/C and/or Python)
- Experience working in a high-level collaborative environment and promoting a teamwork mentality
- Bachelors degree in Computer Science and 7 years of work experience
- Deep understanding of computer systems and the interactions between HW and SW
- Experience in performance analysis and optimization experience in Cloud accelerators
- Advanced degree in CS
Required Experience:
Staff IC
We are seeking highly motivated and experienced engineers to join our team. The ideal candidate will have a deep understanding of machine learning systems and cloud computing infrastructure. Key responsibilities in this role are: - Engage with ML researchers to optimize end-to-end performance of la...
We are seeking highly motivated and experienced engineers to join our team. The ideal candidate will have a deep understanding of machine learning systems and cloud computing infrastructure. Key responsibilities in this role are: - Engage with ML researchers to optimize end-to-end performance of large scale distributed ML workloads - Analyze workload metrics to identify sources of inefficiencies and work with users to understand and optimize ML workloads - Conduct workload analysis based on benchmarking key workloads on deployed systems - Improve large scale training resiliency by optimizing applications and frameworks for improved recovery from failures and preemptions - Influence architecture design development and operations of next generation ML accelerator systems based on workload insights
- Experience working with large scale parallel and distributed accelerator-based systems
- Experience optimizing performance and AI workloads at scale
- Experience developing code in one or more of training frameworks (such as PyTorch TensorFlow or JAX)
- Strong communicator with ability to analyze complex and ambiguous problems
- Programming and software design skills (proficiency in C/C and/or Python)
- Experience working in a high-level collaborative environment and promoting a teamwork mentality
- Bachelors degree in Computer Science and 7 years of work experience
- Deep understanding of computer systems and the interactions between HW and SW
- Experience in performance analysis and optimization experience in Cloud accelerators
- Advanced degree in CS
Required Experience:
Staff IC
View more
View less