5+ years of professional engineering experience, including 3+ years in infrastructure, platform, or site reliability engineering
Deep hands-on experience with Kubernetes: cluster operations, workload management, troubleshooting at scale
Experience owning production incidents end-to-end — response, mitigation, and postmortem
Proficiency in Go or Python
Requirements:
Own and evolve our Kubernetes platform across multiple clusters
Drive technical designs for platform initiatives: scope the problem, propose multiple solutions, and see it through to production
Harden the platform's security posture
Maintain and improve our observability stack
Job description
Your role
We are a leading SaaS provider in our field and operate a large number of customer environments in our cloud. To ensure that these environments run smoothly, we are looking for you! Together with your colleagues on the Platform & Infrastructure team, you will be responsible for the scalability, security, and efficiency of our cloud infrastructure — covering AWS, Kubernetes, observability, and the internal developer platform that our product teams build on. We ship continuously — on average 150 deployments to production every month — and your work is what keeps that pace fast, safe, and sustainable. You will actively contribute to shaping our technical roadmap, mentor engineers around you, and help cultivate a progressive and learning environment.
Your mission
Own and evolve our Kubernetes platform across multiple clusters: manage Helm chart deployments via an OCI registry hosting 40+ charts, enforce policy-as-code with Kyverno, and operate GitOps workflows through Argo CD ApplicationSets with progressive delivery orchestrated by Kargo
Drive technical designs for platform initiatives: scope the problem, propose multiple solutions, assess their trade-offs, and defend your recommendation — then see it through to production
Harden the platform's security posture: workload identity via OIDC, runtime security, image scanning and secrets management
Write and maintain custom Kubernetes operators and internal tooling in Go and Python that multiply the team's leverage across clusters — we run Zalando postgres-operator alongside our own operators
Maintain and improve our observability stack (Prometheus, Grafana, Thanos, OpenSearch): build dashboards and alerts that give product teams real visibility into their services
Keep GitLab CI/CD pipelines fast and reliable — they power ~150 production deployments per month; execute cross-team changes (rollouts, database migrations, certificate rotations) with care and clear communication
Operate and evolve AWS infrastructure (EKS, VPC, IAM, RDS, S3) including dedicated customer environments in their own AWS accounts; drive cost-efficiency initiatives tracked via OpenCost
Own incidents end-to-end: from alert to fix to postmortem
Raise the team's bar through thorough code and architecture reviews, mentor less experienced engineers, and help us assess technical candidates in interviews
Be the infrastructure partner product teams come to when things are unclear or broken
Participate in a shared on-call rotation
Your profile
Must-have
5+ years of professional engineering experience, including 3+ years in infrastructure, platform, or site reliability engineering
Deep hands-on experience with Kubernetes: cluster operations, workload management, troubleshooting at scale
Helm chart authoring: writing, packaging, and maintaining charts — not just consuming them
GitOps experience with Argo CD or an equivalent tool (Flux, etc.)
Working knowledge of AWS (EKS and supporting services such as IAM, VPC, RDS, S3)
Experience with Infrastructure as Code (Terraform or equivalent)
Proficiency in Go or Python — we write custom operators and internal tooling in both
Experience owning production incidents end-to-end — response, mitigation, and postmortem
Strong English communication skills, with the ability to explain technical decisions to both engineers and non-technical stakeholders
Nice-to-have
Kubernetes operator development experience (Kubebuilder or similar)
Progressive delivery tooling (Kargo, Argo Rollouts, or equivalent)
Policy-as-code experience (Kyverno or OPA)
Runtime security tooling (Falco or equivalent)
Familiarity with observability tooling (Grafana, OpenSearch/Elasticsearch, or equivalent)
Experience with CI/CD pipelines (GitLab CI preferred)
Experience mentoring engineers or participating in technical hiring
German language skills (helpful for customer site interactions)
On-Call You'll participate in an on-call rotation. Incidents require fast response and clear communication. On-call burden is tracked and addressed through rotation and capacity planning.
Perks & Benefits
Join our team and simply be yourself – we celebrate diversity! Our entrepreneurial culture is driven by a talented and passionate team. We offer full flexibility to suit your needs, whether you prefer working remotely, in the office, or in a hybrid setup. And if you'd like to work from abroad, you can do so for up to 180 days. But that's not all! We also provide a competitive compensation package and 30 days of paid vacation to help you recharge. And with a generous tech budget, you can choose the hardware and software that suits you best.
Flexible working arrangements (remote, office, or hybrid)
Modern office in the heart of Hanover for hybrid work
Up to 180 days (6 months) of remote work from abroad
Competitive compensation with benefits offering
30 days (6 weeks) of paid vacation
Modern hardware and software solutions
About our principles
Ambitious, promising, and always on the move: that's SYNAOS. We are a team of around 100 people from more than 15 nations, ranked among the top employers of 2,500+ German startups. We stand together and work toward the same goal: to revolutionize an entire industry. Our platform runs production-critical operations 24/7 for customers like Volkswagen, shipping to production around 150 times a month — and we take security seriously: we are ISO 27001 certified and a TISAX participant. Can you identify with our principles?