Agile Infrastructure & Reliability Lead is a senior IT expert accountable for the overall stability reliability and operational excellence of a specific application domain or service area. Agile Infrastructure & Reliability Lead act as technical leader and stability expert driving proactive measures to prevent outages minimize service degradation and foster a culture of continuous improvement in system stability.
Key Responsibilities
- Domain Ownership: Own and oversee the reliability maturity of all systems and services within their assigned area or domain. Segment services based on business impact (e.g. IBI relevance) and prioritize stability measures accordingly. Software architecture and integration
- Stability Strategy: Define and execute a domain-specific stability improvement roadmap aligned with company-wide resilience goals. Drive blast radius reduction initiatives and work toward minimizing changes leading to incidents (CLTI).
- Incident Prevention: Identify and eliminate single points of failure systemic risks and architectural weaknesses in collaboration with development architects and infrastructure teams. Ensure architecture diagrams reflect actual deployment.
- Incident Management Support: Act as a lead technical expert during major incidents affecting their domain supporting root cause analysis and follow-up remediation plans.
- Observability & Monitoring: Ensure sufficient observability is in place (metrics logging alerts) and drive the adoption of SLOs SLIs and error budgets. Ensure monitoring and alerting are comprehensive enough to detect issues proactively before user impact by ensuring the monitoring includes business metrics.
- Collaboration & Governance: Work closely with engineering leads product owners and companywide stability programs to align standards tools and reliability KPIs.
- Postmortem Culture: Drive blameless postmortems lessons learned and systematic fixes that prevent recurrence of issues.
- Capacity Planning: Collaborate with capacity and performance teams to anticipate scaling needs and mitigate risks from traffic or load surges.
- Change Impact Evaluation: Participate in change advisory processes to assess the risk of releases and configuration changes within their domain. Possibly replacing current Change Challenger model.
- Knowledge Sharing & Advocacy: Act as a domain coach by sharing best practices reliability principles and learnings across teams through workshops documentation and mentoring.
- Growth & Development Enablement: Guide the development path for engineers within the domain by helping them understand progression frameworks skill expectations and opportunities to grow their reliability expertise.
Qualifications :
YOU WILL SUCCEED IF YOU:
Have the following experience:
- Strong experience in software engineering system administration or infrastructure roles with a track record of improving service reliability.
- Deep technical understanding of stability related topics and concepts.
- Familiarity with reliability frameworks (SRE principles ITIL DevOps practices).
- Proficiency with observability tools (e.g. Prometheus Grafana ELK etc.).
- Experience leading or contributing to incident management and root cause analysis.
- Excellent communication skills and ability to align cross-functional teams around stability goals.
- Experience working within structured Change Incident and Problem Management frameworks
Speak English at least at the B2 level. Speaking German is your advantage.
Additional Information :
Benefits
We believe in balance between work and personal life. An attractive and extensive work-life balance portfolio guarantees lasting motivation for employees and thus a better quality of life promotes physical and mental well-being and contributes to a positive work environment. All this with the aim of providing more freedom in reconciling work career growth private life and individual lifestyle. Therefore we offer to our employees over 25 different benefits to improve their personal and professional life in these areas:
- Financial benefits
- Benefits with focus on learning and development
- Benefits with focus on health and sport
- Benefits with focus on family and work life balance
- Other benefits
For more information about our benefits click to Benefits
Salary
Final salary is negotiable.
We are offering base salary depending on seniority level and previous experience of candidate. In addition to base salary we provide variable part and other financial benefits. Base salary will not be lower than 2300 /brutto.
Additional information
* Please be informed that our remote working possibility is only available within Slovakia due to European taxation regulation.
Remote Work :
No
Employment Type :
Full-time