Lets start with the role
As a Site Reliability Operations Manager you will lead the operational reliability layer of our production environment ensuring 24/7 service stability across networks applications and infrastructure.
You will own the performance and evolution of our Site Reliability Operations function managing shift-based teams strengthening incident response practices driving measurable improvements in uptime response time and operational maturity and directly handling and overseeing the end-to-end incident flow.
You will be responsible for ensuring that incidents are properly triaged escalated coordinated and resolved while continuously improving our incident management processes.
This role sits at the intersection of Infrastructure Platform Security and Product ensuring that reliability is not reactive but engineered and continuously improved.
Reliability at scale in a high-traffic real-time gaming environment demands precision discipline and strong leadership. This role is critical to that mission.
As a Site Reliability Operations Manager you will:
- Lead and develop the Site Reliability Operations team ensuring high performance across 24/7 shift coverage.
- Own incident management processes including severity classification escalation paths communication standards and post-incident reviews.
- Ensure proactive monitoring of production systems with meaningful alerting that minimizes noise and maximizes actionability.
- Track and improve key operational metrics such as MTTA MTTR uptime and SLA adherence.
- Establish and refine standard operating procedures for monitoring escalation and vendor coordination.
- Drive structured communication during incidents ensuring clear updates to technical and business stakeholders.
- Collaborate closely with SRE Infrastructure Security and Engineering teams to eliminate recurring incidents through root cause analysis and systemic improvements.
- Oversee relationships with external vendors and providers during both routine operations and major outages.
- Promote a culture of operational excellence accountability and continuous improvement.
- Participate in capacity planning and operational readiness reviews for new launches and major changes.
What you will bring
- Proven experience leading technical operations or NOC/SRE Operations teams in high-availability environments.
- Strong understanding of production monitoring alerting systems and incident management frameworks.
- Solid knowledge of networking fundamentals (TCP/IP) infrastructure components and cloud or hybrid environments.
- Experience working in 24/7 operational models with shift-based teams.
- Hands-on familiarity with ticketing systems and operational reporting.
- Ability to analyze operational data and translate it into improvement initiatives.
- Strong stakeholder communication skills especially under pressure.
- Structured thinker with close attention to detail and strong execution discipline.
- Experience in gaming fintech e-commerce or other real-time high-scale digital environments is considered a strong plus.
Required Experience:
Manager
Lets start with the roleAs a Site Reliability Operations Manager you will lead the operational reliability layer of our production environment ensuring 24/7 service stability across networks applications and infrastructure.You will own the performance and evolution of our Site Reliability Operations...
Lets start with the role
As a Site Reliability Operations Manager you will lead the operational reliability layer of our production environment ensuring 24/7 service stability across networks applications and infrastructure.
You will own the performance and evolution of our Site Reliability Operations function managing shift-based teams strengthening incident response practices driving measurable improvements in uptime response time and operational maturity and directly handling and overseeing the end-to-end incident flow.
You will be responsible for ensuring that incidents are properly triaged escalated coordinated and resolved while continuously improving our incident management processes.
This role sits at the intersection of Infrastructure Platform Security and Product ensuring that reliability is not reactive but engineered and continuously improved.
Reliability at scale in a high-traffic real-time gaming environment demands precision discipline and strong leadership. This role is critical to that mission.
As a Site Reliability Operations Manager you will:
- Lead and develop the Site Reliability Operations team ensuring high performance across 24/7 shift coverage.
- Own incident management processes including severity classification escalation paths communication standards and post-incident reviews.
- Ensure proactive monitoring of production systems with meaningful alerting that minimizes noise and maximizes actionability.
- Track and improve key operational metrics such as MTTA MTTR uptime and SLA adherence.
- Establish and refine standard operating procedures for monitoring escalation and vendor coordination.
- Drive structured communication during incidents ensuring clear updates to technical and business stakeholders.
- Collaborate closely with SRE Infrastructure Security and Engineering teams to eliminate recurring incidents through root cause analysis and systemic improvements.
- Oversee relationships with external vendors and providers during both routine operations and major outages.
- Promote a culture of operational excellence accountability and continuous improvement.
- Participate in capacity planning and operational readiness reviews for new launches and major changes.
What you will bring
- Proven experience leading technical operations or NOC/SRE Operations teams in high-availability environments.
- Strong understanding of production monitoring alerting systems and incident management frameworks.
- Solid knowledge of networking fundamentals (TCP/IP) infrastructure components and cloud or hybrid environments.
- Experience working in 24/7 operational models with shift-based teams.
- Hands-on familiarity with ticketing systems and operational reporting.
- Ability to analyze operational data and translate it into improvement initiatives.
- Strong stakeholder communication skills especially under pressure.
- Structured thinker with close attention to detail and strong execution discipline.
- Experience in gaming fintech e-commerce or other real-time high-scale digital environments is considered a strong plus.
Required Experience:
Manager
View more
View less