Key Facts

Remote From:

Germany

Category: Operations Specialist

Fixed term

Expert & Leadership (>10 years)

English, German

Hard Skills

Operating System Development Incident Management Problem Management Change Management VMware Virtualization Red Hat Enterprise Linux Service Management Atlassian Confluence Continuous Monitoring Observability +7 more

Other Skills

•
Troubleshooting (Problem Solving)
•
Collaboration

Roles & Responsibilities

5 to 10+ years in IT operations, service delivery or platform operations with demonstrated leadership in mission-critical environments
Proven experience implementing and leading Incident, Problem, Change and Release governance in production
Hands-on experience with VMware 8 virtualisation
Fluent English and German (C1 minimum in both)

Requirements:

Providing T3 operational ownership for Compute OS services: handling complex incidents, troubleshooting and RCA, and driving permanent fixes and preventive measures
Ensuring compute/OS readiness for releases and changes: monitoring/alerting coverage, performance baselines, hardening, patch strategy, rollback and recovery procedures, and runbooks
Executing and improving standard operational procedures through automation to reduce toil and improve MTTR and stability
Coordinating with Kubernetes, Data, Network and Storage SMEs to resolve cross-domain production issues

Job description

This is a remote position.

T3 Operations & Support Specialist — Compute & OS (PID9066)

Contract / Freelance
Full-time
Remote with travel readiness required (Germany)
Start: ASAP

About the role

We are working with a long-standing anchor client to source a T3 Operations & Support Specialist (Compute & OS) for a large-scale cloud-native platform programme supporting a major energy transmission operator in Germany. The platform is a service-oriented hybrid cloud environment providing application teams with self-service capabilities to develop, run and operate software products across private and public cloud infrastructure.

In this role you will provide Tier-3 operational ownership for Compute & Operating System services within Local Production (DE), handling complex incidents, deep troubleshooting and root cause analysis, and driving permanent fixes and preventive measures.

What you'll be doing

Providing T3 operational ownership for Compute & OS services: handling complex incidents, troubleshooting and RCA, and driving permanent fixes and preventive measures
Ensuring compute/OS readiness for releases and changes: monitoring/alerting coverage, performance baselines, hardening, patch strategy, rollback and recovery procedures, and runbooks
Executing and improving standard operational procedures through automation to reduce toil and improve MTTR and stability
Coordinating with Kubernetes, Data, Network and Storage SMEs to resolve cross-domain production issues
Validating deployment artefacts from an operations perspective and enforcing quality assurance measures
Monitoring system health, performance metrics and service availability across multi-tenant environments
Identifying, analysing and resolving incidents to minimise service disruption, and triggering RCA and corrective actions
Implementing monitoring and logging strategies to support audit and compliance requirements
Performing routine security scans and remediating identified vulnerabilities

Requirements

What you'll need

5 to 10+ years in IT operations, service delivery or platform operations with demonstrated leadership in mission-critical environments
Proven experience implementing and leading Incident, Problem, Change and Release governance in production
Hands-on experience with VMware 8 virtualisation
Operating Systems: Red Hat Enterprise Linux and Ubuntu
OS tooling: Satellite, IPA, Certificate Server
ITSM/collaboration tooling: Jira Service Management, Jira, Confluence
Fundamental understanding of core operations processes (Incident, Change, Problem management, ITSM) and SRE concepts
Experience gathering operational insights from monitoring/observability including SLI/SLA/SLO management and tracking
Hands-on experience documenting procedures and enforcing clear runbooks and playbooks
Hands-on experience with monitoring and logging tools (e.g. Prometheus, Grafana, Datadog, Mimir, Loki)
Understanding of modern platform operations (Kubernetes/containers, automation, observability) sufficient to govern specialists
Fluent English and German (C1 minimum in both)

Desirable

Experience operating in regulated or high-availability industries (banking, telco, public sector, healthcare)
Experience with SRE practices (SLOs/SLIs, error budgets) and reliability management
Familiarity with enterprise DevOps toolchains (GitLab, JFrog Artifactory, Backstage, Harness)
GitOps and IaC awareness (Terraform/OpenTofu, ArgoCD, Helm)

Benefits

As a freelancer / contractor with us, you will enjoy flexible working hours and the freedom to choose your own projects. Our platform gives you access to exciting projects in various industries and supports you in advancing your career. You'll benefit from competitive pay and a dedicated team to help you with any questions you may have. Work independently and utilise our strong network to achieve your professional goals.

Ready to apply?

APPLY

Share ·