Agile Infrastructure & Reliability Lead

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

Extensive experience in software engineering, system administration, or infrastructure roles., Deep understanding of system stability, reliability, and related concepts., Proficiency with observability tools like Prometheus, Grafana, and ELK., Experience in incident management, root cause analysis, and working within structured change management frameworks..

Key responsibilities:

  • Own and oversee the reliability of assigned systems and services.
  • Define and implement stability improvement strategies and roadmaps.
  • Support incident management and lead root cause analysis during major outages.
  • Promote best practices in observability, monitoring, and post-incident learning.

Deutsche Telekom IT Solutions Slovakia logo
Deutsche Telekom IT Solutions Slovakia Large https://www.deutschetelekomitsolutions.sk/
1001 - 5000 Employees
See all jobs

Job description

Company Description

Our brand Deutsche Telekom IT Solutions Slovakia entered the life of Košice region in 2006 under the name of T-Systems Slovakia and ever since has been inextricably linked with the region when became one of the founding members of Košice IT Valley. We have managed to grow from scratch to the second largest employer in the eastern part of the country with more than 3900 employees. Our goal is to proactively find new ways to improve and continuously transform into the type of company providing innovative information and communication technology services.

Job Description

Agile Infrastructure & Reliability Lead is a senior IT expert accountable for the overall stability, reliability, and operational excellence of a specific application, domain or service area. Agile Infrastructure & Reliability Lead act as technical leader and stability expert, driving proactive measures to prevent outages, minimize service degradation, and foster a culture of continuous improvement in system stability.

 

Key Responsibilities

  • Domain Ownership: Own and oversee the reliability maturity of all systems and services within their assigned area or domain. Segment services based on business impact (e.g., IBI relevance), and prioritize stability measures accordingly. Software architecture and integration
  • Stability Strategy: Define and execute a domain-specific stability improvement roadmap, aligned with company-wide resilience goals. Drive blast radius reduction initiatives and work toward minimizing changes leading to incidents (CLTI).
  • Incident Prevention: Identify and eliminate single points of failure, systemic risks, and architectural weaknesses in collaboration with development, architects and infrastructure teams. Ensure architecture diagrams reflect actual deployment.
  • Incident Management Support: Act as a lead technical expert during major incidents affecting their domain, supporting root cause analysis and follow-up remediation plans.
  • Observability & Monitoring: Ensure sufficient observability is in place (metrics, logging, alerts) and drive the adoption of SLOs, SLIs, and error budgets. Ensure monitoring and alerting are comprehensive enough to detect issues proactively before user impact by ensuring the monitoring includes business metrics.
  • Collaboration & Governance: Work closely with engineering leads, product owners, and companywide stability programs to align standards, tools, and reliability KPIs.
  • Postmortem Culture: Drive blameless postmortems, lessons learned, and systematic fixes that prevent recurrence of issues.
  • Capacity Planning: Collaborate with capacity and performance teams to anticipate scaling needs and mitigate risks from traffic or load surges.
  • Change Impact Evaluation: Participate in change advisory processes to assess the risk of releases and configuration changes within their domain. Possibly replacing current Change Challenger model.
  • Knowledge Sharing & Advocacy: Act as a domain coach by sharing best practices, reliability principles, and learnings across teams through workshops, documentation, and mentoring.
  • Growth & Development Enablement: Guide the development path for engineers within the domain by helping them understand progression frameworks, skill expectations, and opportunities to grow their reliability expertise. 

Qualifications

YOU WILL SUCCEED IF YOU:

Have the following experience:

  • Strong experience in software engineering, system administration, or infrastructure roles with a track record of improving service reliability.
  • Deep technical understanding of stability related topics and concepts.
  • Familiarity with reliability frameworks (SRE principles, ITIL, DevOps practices).
  • Proficiency with observability tools (e.g., Prometheus, Grafana, ELK, etc.).
  • Experience leading or contributing to incident management and root cause analysis.
  • Excellent communication skills and ability to align cross-functional teams around stability goals.
  • Experience working within structured Change, Incident, and Problem Management frameworks

Speak English at least at the B2 level. Speaking German is your advantage.

Additional Information

Benefits

We believe in balance between work and personal life. An attractive and extensive work-life balance portfolio guarantees lasting motivation for employees and thus a better quality of life, promotes physical and mental well-being and contributes to a positive work environment. All this with the aim of providing more freedom in reconciling work, career growth, private life and individual lifestyle. Therefore, we offer to our employees over 25 different benefits to improve their personal and professional life in these areas:

  • Financial benefits
  • Benefits with focus on learning and development
  • Benefits with focus on health and sport
  • Benefits with focus on family and work – life balance
  • Other benefits

For more information about our benefits, click to Benefits

Salary

Final salary is negotiable.

We are offering base salary depending on seniority level and previous experience of candidate. In addition to base salary we provide variable part and other financial benefits. Base salary will not be lower than 2300€ /brutto.

Additional information

* Please be informed that our remote working possibility is only available within Slovakia due to European taxation regulation.

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Collaboration
  • Communication
  • Problem Solving

Related jobs