This is a remote position.
Contract opportunity for a Observability & Kubernetes Operations Engineer to optimize platform reliability and manage large-scale cloud infrastructure. In this role, you will take ownership of system visibility and operational readiness by scaling enterprise monitoring tools, streamlining CI/CD pipelines, and maintaining multi-tenant container stability.
Position Type: Contract (1 FTE)
Compensation: Daily rate available
Location: Remote (with occasional onsite visits in Germany)
Language Requirement: English and German fluent
CI/CD Support & Operational Readiness: Validate deployment artifacts from an operational standpoint, define quality assurance measures, and guarantee robust rollback strategies and observability are live for production environments.
Platform Operations & Incident Management: Oversee system health, performance metrics, and service availability across multi-tenant environments, ensuring maximum platform stability and minimal service disruption.
Problem Resolution: Rapidly identify, analyse, and resolve platform incidents, triggering detailed root-cause analyses and rolling out long-term preventative actions.
Automation & SRE Implementation: Mitigate operational toil by automating recurring standard procedures and validating all code updates through testing and staging lifecycles.
Security & Compliance Enforcement: Implement robust monitoring and logging strategies to meet compliance audits, conduct routine security scans, and remediate platform vulnerabilities.
Kubernetes Platform Operations: At least 3 years of deep operational experience managing self-managed Kubernetes clusters and running production applications within on-premise environments.
Observability & Tool Administration: Hands-on experience with the administration, operation, and consumption of logging and monitoring ecosystems (such as Prometheus, Grafana, Datadog, Mimir, Loki, and OpenTelemetry collectors).
Networking Architecture: Deep structural understanding of core networking concepts, including enterprise protocols, load balancing, and network security.
CI/CD & GitOps Integration: Profound knowledge of building continuous integration and delivery processes using modern tooling (such as GitLab, Jenkins, Tekton, Argo Workflows, or Argo CD) alongside relevant security checks.
ITSM & SRE Principles: Fundamental comprehension of core IT Service Management processes (incident, change, and problem management) combined with practical Site Reliability Engineering concepts.
SLO Tracking & Metrics: Proven experience extracting actionable operational insights from platform data, including defining, tracking, and managing SLIs, SLAs, and SLOs.
Technical Documentation: Experienced in cleanly mapping out operational topics, authoring technical documentation, and maintaining actionable team runbooks or playbooks.
Language Skills: Professional fluency in both spoken and written English and German (at least C1 level for both).
Eligibility: Residency and right to work in the EU, EEA, UK, or Switzerland.

Irium Portugal

NAMSA

Baxter International Inc.

Frontier

Danaher Corporation

Interval Group

Interval Group

Interval Group