Are you a seasoned engineer with a passion for ML reliability Were looking for exceptional Software Engineers to join the Reliability team at Roblox. In this pivotal role you will drive the evolution of our ML systems ensuring they meet the highest standards of performance reliability and efficiency. Youll collaborate with crossfunctional teams to build robust ML infrastructure that supports our growth. If you have a track record of solving complex technical challenges we want to hear from you. Join us in shaping the future of our platform and delivering unparalleled value to our users.
At Roblox our vision is to achieve 1 billion daily active users. We believe this engineer will be instrumental in driving us towards that ambitious goal.
You Will:
- Build automate and standardize process automation to create a golden path of ML tooling and platform support that powers the ML Roblox ecosystem.
- Create tooling that provides production guardrails for developing and delivering ML training and inference services to production.
- Create performance monitoring services and observability towards understanding ML capacity issues and platform degradations.
You Have:
- Experience: you have a BS degree (or equivalent professional experience) in Computer Science or related engineering field with at least 6 years of experience including at least 2 years in SRE or Software Engineering.
- Deep experience running Kubernetes clusters in production environments at large scale that are onpremise and hosted.
- Hands on experience with Kubernetes observability maintenance and upgrades of large scale kubernetes clusters.
- Experience running ML training and inference workloads on Kubernetes supporting MLOps frameworks like Kubeflow and working with GPUs
- Experience working with popular machine learning frameworks such as TensorFlow or PyTorch.
- Passion for systems: You have experience and good habits around building software and tools and getting them adopted.
You Are:
- A Partner: You know that the best tools integrate broadly with the tooling ecosystem. You approach partners and processes with curiosity and seek to understand a problem deeply before you start coding.
- A Coder: you have experience writing common programming languages Python Go C#.
- Selforganized: youre excited about getting in front of complex problems organizing your work by any means possible; overcome emergent issues and contributing to longrunning projects as a part of the team.
- Problem Solver: you ask the right questions to solve issues within your expertise and you use data to test your theories.
- Planner You have experience in large project lifecycles. You have experienced working in sprints breaking down complex tasks into milestones and reporting status to keep project scheduling accurate.
Required Experience:
Senior IC