Domain Lead - Site Reliability Management (REF4372N)

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

Proven executive experience in Site Reliability Engineering (SRE), IT operations, or large-scale infrastructure leadership., Deep technical expertise in SRE principles, incident management, observability, and cloud/hybrid architectures (e.g., AWS, Azure, GCP)., Experience leading cross-functional teams and managing organization-wide stability programs., Strong familiarity with modern observability tools and deployment frameworks like Kubernetes and Terraform..

Key responsibilities:

  • Define and lead the company's reliability vision, policies, and strategies.
  • Manage and mentor a distributed team of Site Reliability Engineers.
  • Oversee stability programs, incident response, and systemic fixes to ensure IT service reliability.

Deutsche Telekom IT Solutions HU logo
Deutsche Telekom IT Solutions HU XLarge https://www.deutschetelekomitsolutions.hu/
5001 - 10000 Employees
See all jobs

Job description

Company Description

The largest ICT employer in Hungary, Deutsche Telekom IT Solutions (formerly IT-Services Hungary, ITSH) is a subsidiary of the Deutsche Telekom Group. Established in 2006, the company provides a wide portfolio of IT and telecommunications services with more than 5000 employees. ITSH was awarded with the Best in Educational Cooperation prize by HIPA in 2019, acknowledged as one of the most attractive workplaces by PwC Hungary’s independent survey in 2021 and rewarded with the title of the Most Ethical Multinational Company in 2019. The company continuously develops its four sites in Budapest, Debrecen, Pécs and Szeged and is looking for skilled IT professionals to join its team.

Job Description

The Domain Lead - Site Reliability Management is a senior leadership role responsible for the end-to-end reliability, resilience, and operational excellence of all IT systems within T-Systems. This executive will lead a distributed team of 10 Site Reliability Engineers embedded throughout the company, setting the strategic direction for reliability engineering and ensuring the stability of critical business services operating and developing our entire internal IT landscape. The role is pivotal in driving a culture of continuous improvement, proactive risk management, and blameless learning throughout the IT organization bringing new technology and smart solutions to the forefront of the company's future.  .

Purpose of the role is:

  • To serve as the organization's chief stability and reliability authority, accountable for the availability, performance, and recoverability of all IT services.
  • Lead the design and execution of a comprehensive reliability strategy, aligning with business objectives and regulatory requirements.
  • Foster a company-wide culture of resilience, incident prevention, and operational transparency .

Key Responsibilities

  • Strategic Leadership: Define and champion the company’s reliability vision, policies, and maturity roadmap. Set and monitor organizational SLOs, SLIs, and error budgets .
  • Team Management: Direct and mentor a distributed team of SRMs, ensuring consistent standards, knowledge sharing, and professional growth across domains.
  • Reliability Governance: Oversee domain-wide stability programs, coordinate cross-functional reliability initiatives, and ensure alignment with business impact priorities.
  • Incident Command: Act as the executive escalation point during major incidents, ensuring effective incident response, root cause analysis, and implementation of systemic fixes.
  • Observability & Monitoring: Ensure comprehensive observability across all platforms, driving adoption of modern monitoring tools and practices to enable proactive detection and resolution .
  • Infrastructure & Deployment: Oversee the reliability of CI/CD pipelines, infrastructure as code practices, and deployment strategies (e.g., canary releases, blue-green deployments).
  • Resilience Engineering: Lead organization-wide initiatives in chaos engineering, failure testing, and capacity planning to minimize blast radius and prevent outages.
  • Change Management: Guide risk assessment and approval of major releases and configuration changes, potentially replacing legacy Change Challenger models.
  • Stakeholder Collaboration: Partner with engineering, product, and business leaders to align reliability goals, communicate risk, and drive adoption of best practices.
  • Culture & Learning: Promote a blameless postmortem culture, facilitate reliability workshops, and ensure continuous learning and improvement.

Qualifications

Key Qualifications:

  • Proven executive experience in SRE, IT operations, or large-scale infrastructure leadership within complex, distributed environments.
  • Deep technical expertise in SRE principles, incident management, observability, and cloud/hybrid architectures (e.g., AWS, Azure, GCP).
  • Demonstrated success in leading cross-functional teams, driving organization-wide stability programs, and managing high-stakes incidents.
  • Strong familiarity with modern observability tools (Prometheus, Grafana, ELK, Datadog) and deployment frameworks (Kubernetes, Terraform, Ansible).
  • Exceptional communication skills, with the ability to influence senior stakeholders and coach both technical and non-technical teams.
  • Experience with ITIL, DevOps, and structured Change, Incident, and Problem Management frameworks.

Success Metrics:

  • Reduction in critical incidents, IBIs, and Mean Time to Repair (MTTR).
  • Measurable improvements in observability, monitoring coverage, and SLO adherence.
  • Implementation and tracking of preventive actions and systemic fixes.
  • Organization-wide visibility and mitigation of stability risks.
  • Delivery and execution of a reliability roadmap, with clear progress metrics .

 

Core Knowledge Areas:

  • SRE principles (error budgets, toil reduction, SLOs/SLIs)
  • Incident lifecycle and blameless postmortems
  • Observability and monitoring (metrics, logging, alerting)
  • Infrastructure as code, CI/CD, deployment best practices
  • Chaos engineering, load and failure testing
  • Cloud and hybrid system design, geo-redundancy
  • Governance, communication, and cross-domain collaboration

Additional Information

* Please be informed that our remote working possibility is only available within Hungary due to European taxation regulation.

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Team Management
  • Communication

Site Reliability Engineer (SRE) Related jobs