As a staff engineer on ML Compute team your work will include:- Lead the development of the infrastructure to run large-scale workloads on the Cloud such as Apache Spark Ray and distributed training.- Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue.- Integrate new features from core distributed computing and ML frameworks into the platform offering them to production users and providing support.- Enhance the platforms scalability performance and observability through improved monitoring and logging.- Drive the architectural evolution of the platform by adopting modern cloud-native technologies to improve system performance efficiency and scalability.- Reduce dev-ops efforts by automating and streamlining operational processes.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.
- Bachelors in Computer Science engineering or a related field
- 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
- Proficient in relevant programming languages like Python or Go
- Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
- Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
- Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find solutions
- Advance degrees in Computer Science engineering or a related field.
- Hands-on experience with cloud-native resource management and scheduling tools like Apache YuniKorn.
- Experience with advanced architecture for distributed data processing and ML workloads.
- Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium.
Required Experience:
Staff IC
As a staff engineer on ML Compute team your work will include:- Lead the development of the infrastructure to run large-scale workloads on the Cloud such as Apache Spark Ray and distributed training.- Optimize platform efficiency and throughput by improving resource management capabilities with sc...
As a staff engineer on ML Compute team your work will include:- Lead the development of the infrastructure to run large-scale workloads on the Cloud such as Apache Spark Ray and distributed training.- Optimize platform efficiency and throughput by improving resource management capabilities with schedulers like Apache YuniKorn and Kueue.- Integrate new features from core distributed computing and ML frameworks into the platform offering them to production users and providing support.- Enhance the platforms scalability performance and observability through improved monitoring and logging.- Drive the architectural evolution of the platform by adopting modern cloud-native technologies to improve system performance efficiency and scalability.- Reduce dev-ops efforts by automating and streamlining operational processes.- Mentor engineers in areas of your expertise fostering skill growth and knowledge sharing.
- Bachelors in Computer Science engineering or a related field
- 6 years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
- Proficient in relevant programming languages like Python or Go
- Strong expertise in distributed systems reliability and scalability containerization and cloud platforms
- Proficient in cloud computing infrastructure and tools: Kubernetes Ray PySpark
- Ability to clearly and concisely communicate technical and architectural problems while working with partners to iteratively find solutions
- Advance degrees in Computer Science engineering or a related field.
- Hands-on experience with cloud-native resource management and scheduling tools like Apache YuniKorn.
- Experience with advanced architecture for distributed data processing and ML workloads.
- Proficient in working with and debugging accelerators like: GPU TPU AWS Trainium.
Required Experience:
Staff IC
View more
View less