SRE Manager
Department:
Job Summary
WHAT YOULL DO:
- Lead and manage production & non-production support ensuring high availability and system reliability
- Drive SRE best practices including incident management root cause analysis and continuous improvement Assume ownership of major incidents and drive coordinating efforts to ensure quick resolution of impacting events.
- Collaborate with SRE team members for design and development of observability practices like Dashboarding Logging Metrics Tracing etc. They aim to diagnose and troubleshoot issues proactively.
- Collaborate with SRE team members to define Service Level Objectives (SLO) and agreements (SLA) of critical systems. They also monitor and maintain the uptime of these systems in-line with the defined SLOs and SLAs.
- Identify and remove blockers escalate appropriately and continuous momentum of troubleshooting efforts.
- Ensure adherence to established incident management processes and protocols.
- Contribute to the improvement of incident response runbooks and documentation.
- Own internal and external communications during major incidents.
- Translate technical details into business-impact language (scope severity risk ETA confidence level).
- Maintain clear and continuous communication with stakeholders during incidents providing timely updates.
- Ensure safe execution of mitigations rollbacks feature flags and failovers
- Lead post incident review meetings with stakeholders to confirm event details and assign problem investigators.
- Track and report on incident metrics identifying patterns and areas for systemic improvement.
- Augment Change Managers and / or Problem Managers as required in the performance of those responsibilities.
Qualifications :
WHAT YOUVE DONE:
- Bachelors or masters Degree and/or equivalent experience relevant to functional area.
- 12 years of experience in SRE / DevOps
- 5 years of working experience as a Site Reliability Engineer
- Experience managing critical incidents in a 24/7 production environment.
- Experience with ServiceNow ITSM and oncall incident coordination via PagerDuty / Zen duty (or comparable ITSM/oncall tools).
Knowledge Skills Abilities & Behaviours
- Understand a wide breadth of technical concepts across SRE practices
- Background in cloud-based systems and SRE practices is a must.
- Experience in at-least one Observability platform like New Relic Datadog etc. preferred.
- Ability to use AI tools to synthesize communication reports and troubleshooting leads.
- Certification in AWS ITIL or related frameworks preferred.
- Experience in SaaS or technology product companies preferred.
- Strong leadership and decision-making skills under pressure.
- Excellent verbal and written communication skills for both technical and non-technical audiences.
- Ability to manage multiple priorities and deadlines in high-stakes situations.
- Strong analytical skills to drive root cause analysis and trend identification.
- Familiarity with modern monitoring and incident management tools.
- Demonstrated ability to build consensus across diverse teams.
- Effective at maintaining calm and focus during critical situations.
- Knowledge of cloud infrastructure (e.g. AWS Azure) and application architecture.
- Proven track record of improving incident management processes.
- Attention to detail in documentation and follow-through.
- Adept at facilitating collaboration across remote and global teams.
- Proactive in identifying operational risks and implementing preventive measures.
- Committed to continuous learning and process improvement.
- Ethical dependable and resilient in challenging scenarios.
Additional Information :
Day off on the 3rd Friday of every month (one long weekend each month)
Monthly Wellness Reimbursement Program to promote health well-being
Monthly Office Commutation Reimbursement Program
Paid paternity and maternity leaves
Remote Work :
No
Employment Type :
Full-time
About Company
Forbes Advisor is a new initiative for consumers under the Forbes Marketplace umbrella that provides journalist- and expert-written insights, news and reviews on all things personal finance, health, business, and everyday life decisions. We do this by providing consumers with the kno ... View more