Principal Machine Learning Engineer Reliability

Roblox

Not Interested
Bookmark
Report This Job

profile Job Location:

San Mateo, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: 13 hours ago
Vacancies: 1 Vacancy

Job Summary

Why Reliability

Roblox serves over 100 million people every day across a platform that is constantly evolving and behind every experience is infrastructure that has to work every time at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability of the platform is a key company goal. We are hiring our first Principal Machine Learning engineer within our team.

As a Principal Machine Learning Engineer within Reliability you will set the 3-5 year technical strategy and architectural blueprint for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs traces metrics production changes to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR). You will have the opportunity to cross functionally collaborate with other similar teams at Roblox to define best practices and software.

You will:

  • Define and Own the Technical Vision: Define and lead the multi-year technical vision architectural strategy and execution for machine learning solutions in Content Safety ensuring these systems proactively and effectively detect and mitigate violative content at massive scale.
  • Strategic Stakeholder Partnership: Collaborate with executive-level Product Data Science Policy and Operations leaders to define and prioritize the strategic machine learning roadmap influencing product strategy and demonstrating the impact of ML on user trust and safety outcomes.
  • Lead Innovation: Oversee the adoption and safe deployment of innovative machine learning techniques (e.g. transfer-learning self-supervised learning quantization LoRA distillation).
  • Drive End-to-End Product Development: You will not just model; you will build. You will work cross-functionally to construct datasets from scratch where none exist build auto-labeling pipelines and ship solutions to solve novel technical problems.
  • Ship Code Not Just Models: Expect to spend roughly 30-40% of your time on backend and integration work. You will be responsible for integrating your work into the production stack leveraging modern AI coding tools (e.g. Cursor) to accelerate velocity and handle infrastructure complexity

You have:

  • 8 years of experience designing developing and operating large-scale high-impact machine learning systems in a production environment.
  • A proven track record of successfully setting the long-term technical direction for an entire ML domain demonstrating the ability to take ambiguous problems from concept to scaled production impact.
  • Deep expertise in advanced ML architectures and techniques including Computer Vision (CV) and/or Vision-Language Models (VLMs)
  • Expertise in architecting scalable real-time ML inference services and robust data pipelines
  • Demonstrated success in leading and resolving high-stakes cross-functional conflicts and technical disagreements with an ability to build consensus among diverse stakeholders.
  • Exceptional product sense and strategic planning ability: able to translate platform safety requirements into an achievable iterative technical roadmap.

You are:

  • A Visionary Architect: Capable of synthesizing complex business and safety goals into a clear compelling and actionable technical strategy.
  • A Pragmatic Builder: You are scrappy and impact-oriented. You view undefined data and messy systems as opportunities to build structure rather than blockers to progress.
  • Comfortable with Ambiguity: You thrive in undefined or open-ended problem spaces providing structure clarity and decisive direction to your teams.
  • An Inspiring Leader: Passionate about developing the next generation of technical leaders managers and engineers.
  • An Executive Communicator: Highly effective at communicating complex technical concepts to both engineering teams and non-technical executive leadership.
  • Committed to Ethical AI: Dedicated to building ML systems that are fair transparent and operate with the utmost responsibility toward user safety and platform civility.

Required Experience:

Staff IC

Why ReliabilityRoblox serves over 100 million people every day across a platform that is constantly evolving and behind every experience is infrastructure that has to work every time at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Company Logo

Roblox is the ultimate virtual universe that lets you create, share experiences with friends, and be anything you can imagine. Join millions of people and discover an infinite variety of immersive experiences created by a global community!

View Profile View Profile