As an engineer on ML Compute team your work will include:- Drive large-scale pre-training initiatives to support cutting-edge foundation models focusing on resiliency efficiency scalability and resource optimization.- Enhance distributed training techniques for foundation models.- Research and implement new patterns and technologies to improve system performance maintainability and design.- Optimize execution and performance of workloads built with JAX PyTorch XLA and CUDA on large distributed systems.- Leverage high-performance networking technologies such as NCCL for GPU collectives and TPU interconnect (ICI/Fabric) for large-scale distributed training.- Architect a robust MLOps platform to streamline and automate pretraining operations.- Operationalize large-scale ML workloads on Kubernetes ensuring distributed trainings are robust efficient and fault-tolerant.- Lead complex technical projects defining requirements and tracking progress with team members.- Collaborate with cross-functional engineers to solve large-scale ML training challenges.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.- Cultivate a team centered on collaboration technical excellence and innovation.
- Bachelors in Computer Science engineering or a related field
- 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
- Proficient in relevant programming languages like Python or Go
- Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
- Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
- Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find
- Advance degrees in Computer Science engineering or a related field
- Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium
- Proficient in ML training and deployment frameworks like: JAX Tensorflow PyTorch TensorRT vLLM
Required Experience:
Staff IC
As an engineer on ML Compute team your work will include:- Drive large-scale pre-training initiatives to support cutting-edge foundation models focusing on resiliency efficiency scalability and resource optimization.- Enhance distributed training techniques for foundation models.- Research and imple...
As an engineer on ML Compute team your work will include:- Drive large-scale pre-training initiatives to support cutting-edge foundation models focusing on resiliency efficiency scalability and resource optimization.- Enhance distributed training techniques for foundation models.- Research and implement new patterns and technologies to improve system performance maintainability and design.- Optimize execution and performance of workloads built with JAX PyTorch XLA and CUDA on large distributed systems.- Leverage high-performance networking technologies such as NCCL for GPU collectives and TPU interconnect (ICI/Fabric) for large-scale distributed training.- Architect a robust MLOps platform to streamline and automate pretraining operations.- Operationalize large-scale ML workloads on Kubernetes ensuring distributed trainings are robust efficient and fault-tolerant.- Lead complex technical projects defining requirements and tracking progress with team members.- Collaborate with cross-functional engineers to solve large-scale ML training challenges.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.- Cultivate a team centered on collaboration technical excellence and innovation.
- Bachelors in Computer Science engineering or a related field
- 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
- Proficient in relevant programming languages like Python or Go
- Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
- Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
- Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find
- Advance degrees in Computer Science engineering or a related field
- Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium
- Proficient in ML training and deployment frameworks like: JAX Tensorflow PyTorch TensorRT vLLM
Required Experience:
Staff IC
View more
View less