drjobs Domain Lead - Site Realibility Management (REF4372N)

Domain Lead - Site Realibility Management (REF4372N)

Employer Active

1 Vacancy
drjobs

Job Alert

You will be updated with latest job alerts via email
Valid email field required
Send jobs
Send me jobs like this
drjobs

Job Alert

You will be updated with latest job alerts via email

Valid email field required
Send jobs
Job Location drjobs

Budapest - Hungary

Monthly Salary drjobs

Not Disclosed

drjobs

Salary Not Disclosed

Vacancy

1 Vacancy

Job Description

The Domain Lead - Site Realibility Management is a senior leadership role responsible for the end-to-end reliability resilience and operational excellence of all IT systems within T-Systems. This executive will lead a distributed team of 10 Site Reliability Engineers embedded throughout the company setting the strategic direction for reliability engineering and ensuring the stability of critical business services operating and developing our entire internal IT landscape. The role is pivotal in driving a culture of continuous improvement proactive risk management and blameless learning throughout the IT organization bringing new technology and smart solutions to the forefront of the companys future.  .

Purpose of the role is:

  • To serve as the organizations chief stability and reliability authority accountable for the availability performance and recoverability of all IT services.
  • Lead the design and execution of a comprehensive reliability strategy aligning with business objectives and regulatory requirements.
  • Foster a company-wide culture of resilience incident prevention and operational transparency .

Key Responsibilities

  • Strategic Leadership: Define and champion the companys reliability vision policies and maturity roadmap. Set and monitor organizational SLOs SLIs and error budgets .
  • Team Management: Direct and mentor a distributed team of SRMs ensuring consistent standards knowledge sharing and professional growth across domains.
  • Reliability Governance: Oversee domain-wide stability programs coordinate cross-functional reliability initiatives and ensure alignment with business impact priorities.
  • Incident Command: Act as the executive escalation point during major incidents ensuring effective incident response root cause analysis and implementation of systemic fixes.
  • Observability & Monitoring: Ensure comprehensive observability across all platforms driving adoption of modern monitoring tools and practices to enable proactive detection and resolution .
  • Infrastructure & Deployment: Oversee the reliability of CI/CD pipelines infrastructure as code practices and deployment strategies (e.g. canary releases blue-green deployments).
  • Resilience Engineering: Lead organization-wide initiatives in chaos engineering failure testing and capacity planning to minimize blast radius and prevent outages.
  • Change Management: Guide risk assessment and approval of major releases and configuration changes potentially replacing legacy Change Challenger models.
  • Stakeholder Collaboration: Partner with engineering product and business leaders to align reliability goals communicate risk and drive adoption of best practices.
  • Culture & Learning: Promote a blameless postmortem culture facilitate reliability workshops and ensure continuous learning and improvement.

Qualifications :

Key Qualifications:

  • Proven executive experience in SRE IT operations or large-scale infrastructure leadership within complex distributed environments.
  • Deep technical expertise in SRE principles incident management observability and cloud/hybrid architectures (e.g. AWS Azure GCP).
  • Demonstrated success in leading cross-functional teams driving organization-wide stability programs and managing high-stakes incidents.
  • Strong familiarity with modern observability tools (Prometheus Grafana ELK Datadog) and deployment frameworks (Kubernetes Terraform Ansible).
  • Exceptional communication skills with the ability to influence senior stakeholders and coach both technical and non-technical teams.
  • Experience with ITIL DevOps and structured Change Incident and Problem Management frameworks.

Success Metrics:

  • Reduction in critical incidents IBIs and Mean Time to Repair (MTTR).
  • Measurable improvements in observability monitoring coverage and SLO adherence.
  • Implementation and tracking of preventive actions and systemic fixes.
  • Organization-wide visibility and mitigation of stability risks.
  • Delivery and execution of a reliability roadmap with clear progress metrics .

 

Core Knowledge Areas:

  • SRE principles (error budgets toil reduction SLOs/SLIs)
  • Incident lifecycle and blameless postmortems
  • Observability and monitoring (metrics logging alerting)
  • Infrastructure as code CI/CD deployment best practices
  • Chaos engineering load and failure testing
  • Cloud and hybrid system design geo-redundancy
  • Governance communication and cross-domain collaboration


Additional Information :

* Please be informed that our remote working possibility is only available within Hungary due to European taxation regulation.


Remote Work :

Yes


Employment Type :

Full-time

Employment Type

Remote

Company Industry

Report This Job
Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.