Domain Lead - Site Realibility Management (REF4372N)

Deutsche Telekom IT Solutions

Posted on : 09-07-2025

Employer Active

1 Vacancy

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Send me jobs like this

Job Alert

You will be updated with latest job alerts via email

Valid email field required

Send jobs

Job Location

Budapest - Hungary

Monthly Salary

Not Disclosed

Salary Not Disclosed

Vacancy

1 Vacancy

Posted on : 09-07-2025

Job Description

The Domain Lead - Site Realibility Management is a senior leadership role responsible for the end-to-end reliability resilience and operational excellence of all IT systems within T-Systems. This executive will lead a distributed team of 10 Site Reliability Engineers embedded throughout the company setting the strategic direction for reliability engineering and ensuring the stability of critical business services operating and developing our entire internal IT landscape. The role is pivotal in driving a culture of continuous improvement proactive risk management and blameless learning throughout the IT organization bringing new technology and smart solutions to the forefront of the companys future. .

Purpose of the role is:

To serve as the organizations chief stability and reliability authority accountable for the availability performance and recoverability of all IT services.
Lead the design and execution of a comprehensive reliability strategy aligning with business objectives and regulatory requirements.
Foster a company-wide culture of resilience incident prevention and operational transparency .

Key Responsibilities

Strategic Leadership: Define and champion the companys reliability vision policies and maturity roadmap. Set and monitor organizational SLOs SLIs and error budgets .
Team Management: Direct and mentor a distributed team of SRMs ensuring consistent standards knowledge sharing and professional growth across domains.
Reliability Governance: Oversee domain-wide stability programs coordinate cross-functional reliability initiatives and ensure alignment with business impact priorities.
Incident Command: Act as the executive escalation point during major incidents ensuring effective incident response root cause analysis and implementation of systemic fixes.
Observability & Monitoring: Ensure comprehensive observability across all platforms driving adoption of modern monitoring tools and practices to enable proactive detection and resolution .
Infrastructure & Deployment: Oversee the reliability of CI/CD pipelines infrastructure as code practices and deployment strategies (e.g. canary releases blue-green deployments).
Resilience Engineering: Lead organization-wide initiatives in chaos engineering failure testing and capacity planning to minimize blast radius and prevent outages.
Change Management: Guide risk assessment and approval of major releases and configuration changes potentially replacing legacy Change Challenger models.
Stakeholder Collaboration: Partner with engineering product and business leaders to align reliability goals communicate risk and drive adoption of best practices.
Culture & Learning: Promote a blameless postmortem culture facilitate reliability workshops and ensure continuous learning and improvement.

Qualifications :

Key Qualifications:

Proven executive experience in SRE IT operations or large-scale infrastructure leadership within complex distributed environments.
Deep technical expertise in SRE principles incident management observability and cloud/hybrid architectures (e.g. AWS Azure GCP).
Demonstrated success in leading cross-functional teams driving organization-wide stability programs and managing high-stakes incidents.
Strong familiarity with modern observability tools (Prometheus Grafana ELK Datadog) and deployment frameworks (Kubernetes Terraform Ansible).
Exceptional communication skills with the ability to influence senior stakeholders and coach both technical and non-technical teams.
Experience with ITIL DevOps and structured Change Incident and Problem Management frameworks.

Success Metrics:

Reduction in critical incidents IBIs and Mean Time to Repair (MTTR).
Measurable improvements in observability monitoring coverage and SLO adherence.
Implementation and tracking of preventive actions and systemic fixes.
Organization-wide visibility and mitigation of stability risks.
Delivery and execution of a reliability roadmap with clear progress metrics .

Core Knowledge Areas:

SRE principles (error budgets toil reduction SLOs/SLIs)
Incident lifecycle and blameless postmortems
Observability and monitoring (metrics logging alerting)
Infrastructure as code CI/CD deployment best practices
Chaos engineering load and failure testing
Cloud and hybrid system design geo-redundancy
Governance communication and cross-domain collaboration

Additional Information :

* Please be informed that our remote working possibility is only available within Hungary due to European taxation regulation.

Remote Work :

Yes

Employment Type :

Full-time

Employment Type

Remote

Company Industry

Key Skills

Apply Now

About Company

Deutsche Telekom IT Solutions

Report This Job

Disclaimer: Drjobpro.com is only a platform that connects job seekers and employers. Applicants are advised to conduct their own independent research into the credentials of the prospective employer.We always make certain that our clients do not endorse any request for money payments, thus we advise against sharing any personal or bank-related information with any third party. If you suspect fraud or malpractice, please contact us via contact us page.

Start Now

Dr.Job AutoApply

3X your job search with AutoApply's AI for faster dream job results.

Domain Lead - Site Realibility Management (REF4372N)

Deutsche Telekom IT Solutions

Job Description

Employment Type

Company Industry

Key Skills

About Company

Similar Jobs

Property Management Assistant (ZR_25068_JOB)

Lead Generation & Data Research Specialist (ZR_25044_JOB)

IT Product Management Lead

Supply Chain Specialist (on-site)

Logistics Specialist - Automotive (on-site)

Supply Chain Director (on-site)

Export Compliance Specialist (on-site)

Management Team Member