What are we Looking for
- As aStaff Incident and Escalation Manageryou will lead the response to high-severity production incidents that impact customers and mission-critical services.
- Operating in a fast-paced cloud-native environment with globally distributed teams you will act as the central point of coordination during major incidents ensure timely resolution maintain clear communication and drive long-term process improvements.
- This is a high-impact role with visibility across the organization and a direct influence on customer trust and platform reliability.
What will you do
- Serve as the primary incident commander for high-severity incidents across our production environment.
- Coordinate real-time troubleshooting efforts across globally distributed engineering and operations teams.
- Provide timely accurate updates to stakeholders customers (as needed) and executive leadership.
- Determine appropriate escalation paths and timing to drive fast resolution.
- Collaborate across Engineering SRE Product and Support functions to ensure rapid alignment and resource mobilization.
- Facilitate post-incident reviews for high severity events ensuring root cause analysis and comprehensive documentation.
- Promote a blameless culture and drive learning-focused retrospectives.
- Ensure follow-up action items are clearly defined assigned and completed on schedule.
- Maintain visibility into resolution progress and escalate blockers as needed.
- Enhance incident response practices through process development tooling improvements and knowledge sharing.
- Partner with global SRE and Engineering teams to improve observability monitoring alerting and runbook quality.
- Participate in a rotating on-call schedule as a designated incident commander for major incidents.
- Be available during your on-call shift to lead incident calls coordinate cross-functional teams and drive resolution.
- Ensure smooth handoffs between on-call rotations and maintain accurate status documentation.
- On-call participation is shared equitably across the team and supported with clear escalation protocols and backup coverage.
What Skills and Experience Will You Need
- 5 years of experience in incident management SRE or operations roles within SaaS or cloud-native environments.
- Demonstrated ability to lead complex high-severity incident response efforts across global teams.
- Strong communication and leadership skills with the ability to stay composed under pressure.
- Experience with observability and incident tooling (e.g. PagerDuty Opsgenie Datadog Splunk Jira).
- Deep understanding of service reliability principles escalation strategies and root cause analysis methodologies.
- Comfortable working across time zones and in a fast-paced evolving environment.
Why us
You will be joining a cutting-edge company where you will tackle extraordinary challenges and work with the very best in the industry.
- Health Insurance
- Industry-leading gender-neutral parental leave
- Paid Company Holidays
- Paid Sick Time
- Employee stock purchase program
- Employee assistance program
- Gym membership
- Cell phone/wifi allowance
- Numerous company-sponsored events including regular happy hours and team-building events
Required Experience:
Manager