Principal Site Reliability Operations Engineer

Roblox

Not Interested
Bookmark
Report This Job

profile Job Location:

San Mateo, CA - USA

profile Monthly Salary: Not Disclosed
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

As a Senior Site Reliability Operations Engineer on the Reliability Team you will manage production incidents and improve Robloxs incident processes while reporting to the Senior Operations Manager. You will maintain reliability service-level objectives drive incidents tenaciously to resolution and work with service teams towards appropriate action items during the incident postmortem process. If you are passionate about maintaining uptime in a complex distributed environment full of continuous change youll be right at home with our Reliability will report to the Senior Manager Reliability Response.

You Will:

  • Lead and manage production incidents.
  • Collaborate cross-functionally to troubleshoot and resolve sophisticated technical challenges.
  • Guide the implementation of incident management processes and procedures ensuring fast and effective responses to minimize impact.
  • Continually monitor system health performance and capacity proactively addressing potential issues.
  • Conduct comprehensive post-mortem analysis to ascertain the root cause of incidents and formulate corrective measures.
  • Contribute substantially to the design and enhancement of system architecture to boost reliability and performance.
  • Leverage coding skills to automate daily routine tasks and enhance system efficiency.
  • Serve in the Incident Manager On-Call rotation.
  • Mentor junior team members.

You Have:

  • At least 8 years of experience in a comparable role within a Site Reliability Team.
  • Advanced knowledge of systems and network infrastructure protocols.
  • Demonstrated ability in managing troubleshooting and resolving incidents in distributed environments.
  • Experience solving problems.
  • An ability to distill complex technical issues into clear and concise language.
  • Familiarity with at least one scripting or programming language to automate routine tasks (Python Golang or similar languages preferred).
  • Bachelors degree or equivalent experience in Computer Science Computer Engineering or a similar technical field

You Are:

  • A great communicator; you are able to explain complex systems clearly to stakeholders and fellow engineers.
  • Able to operate in potentially ambiguous circumstances during a production incident.
  • Familiar with the interactions of services in a distributed system.
  • Tenacious towards driving challenging production incidents to resolution.

Required Experience:

Staff IC

As a Senior Site Reliability Operations Engineer on the Reliability Team you will manage production incidents and improve Robloxs incident processes while reporting to the Senior Operations Manager. You will maintain reliability service-level objectives drive incidents tenaciously to resolution and ...
View more view more

Key Skills

  • Kubernetes
  • FMEA
  • Continuous Improvement
  • Elasticsearch
  • Go
  • Root cause Analysis
  • Maximo
  • CMMS
  • Maintenance
  • Mechanical Engineering
  • Manufacturing
  • Troubleshooting

About Company

Company Logo

Roblox is the ultimate virtual universe that lets you create, share experiences with friends, and be anything you can imagine. Join millions of people and discover an infinite variety of immersive experiences created by a global community!

View Profile View Profile