DescriptionWe are seeking an experienced Lead Site Reliability Engineer (SRE) to manage and guide our team. The ideal candidate will have a strong foundation in SRE DevOps or infrastructure engineering with leadership skills and the ability to drive team success in a fast-paced dynamic environment. This role involves overseeing the teams execution risk management and strategic initiatives while fostering a collaborative and innovative culture.
Key Responsibilities:
Team Leadership and Management:
- Lead mentor and develop a team of SREs fostering a culture of collaboration and continuous improvement
- Set clear goals and expectations for the team ensuring alignment with business objectives.
- Facilitate regular team meetings and one-on-one sessions to support individual growth and team cohesion
Execution and Delivery:
- Oversee the delivery of major themes of work ensuring high-quality execution and timely completion
- Guide the team in estimating delivery timelines and managing workloads effectively
- Provide expert guidance in debugging and systems design encouraging innovative solutions and trade-off analysis
Risk Management:
- Assess cross-impact of team deliverables and ensure proactive communication of potential risks
- Support the team in identifying technical limitations and suggesting remediation strategies
Strategic Vision and Forward Thinking:
- Develop and implement strategic plans for building robust systems with strong contracts anticipating future changes
- Encourage the team to propose alternative requirements and solutions that better meet organizational needs
- Set and prioritize the strategic book of work for the team in line to support goals of the business
Communication and Stakeholder Engagement:
- Communicate effectively with stakeholders providing updates on progress and raising risks that will impact delivery
- Ensure the team is aligned with the business vision and understands the importance of their contributions to the product
Qualifications:
- Experience directly leading or functioning as a lead of technical teams with a focus on SRE DevOps or infrastructure engineering
- Proficiency in programming languages (Python preferred) and distributed systems (Kubernetes Kafka Cassandra etc.)
- Experience with setting up and using SLOs to track system health and performance
- Excellent problem-solving skills and creativity in debugging complex issues
- Deep understanding of cloud fundamentals and infrastructure management
- Exceptional communication skills with the ability to articulate technical problems and solutions to diverse audiences
- A strategic mindset with a keen interest in automation and learning
- Having a thorough understanding of the full stack of the system
Am example of a Task/Problem to be tackled is below. Does leading a team solving system wide problems excite you
Our system has been working properly for the past few days in our UAT environment. We deployed a new version of core infrastructure that was tested in dev we found it to be working & then approved it for UAT release. Suddenly one of our services is not starting & our product or QA team cannot test changes in this environment. We receive a ping/bug report that provides high level information about what is happening what the user would like to happen & perhaps information about what they expect to happen. We ask you to take a look at the issue.. Resolving this involves:
- Asking & communicating with the user to fully understand what the issue is
- Understanding where in the stack to begin debugging
- Constantly questioning your assumptions about the way the system should work
- Being able to ask the right questions to your peers & team to triage an issue
- Providing updates to stakeholders that are counting on you to identify or fix the problem
- Using your technical skill set to identify/reproduce the issue
- Communicating what you have found to the team so that we can best resolve the issue