Match score not available

Site Reliability Engineer

extra holidays - fully flexible

Work set-up:

Full Remote

Contract:

Experience:

Mid-level (2-5 years)

Work from:

Offer summary

Qualifications:

Proficiency in monitoring and observability tools like Grafana, Prometheus, or OpenTelemetry., Experience with incident management tools such as Incident.io., Strong coding skills in languages like Go or Python., Knowledge of automation, CI/CD pipelines, and infrastructure as code (e.g., Terraform, Kubernetes)..

Key responsibilities:

Design and implement alerting and monitoring systems.
Collaborate with development teams to improve system observability.
Build self-healing and automation tools to enhance system reliability.
Participate in 24/7 support rotation to assist with incident response.

Unitary Startup https://www.unitary.ai/

11 - 50 Employees

Job description

The company

We are a rapidly growing startup developing solutions that blend human expertise and AI agents to handle manual customer and marketplace operations tasks. Our unique approach combines the strengths of human expertise (high accuracy and nuanced decisionmaking) with the advantages of AI automation (speed and cost efficiency). This cuttingedge technology helps businesses solve realworld challenges in trust & safety and beyond without complex technical integration. We believe in an online world free from harm, where we can trust AI to make safe and fair decisions.
We have raised about $25M in VC funding from top tier funds including Creandum and Plural, and operate at significant scale analysing millions of daily images and videos. But we are just at the beginning of our journey and we are very excited about our plans for growth over the coming year and beyond!

The role

We are now looking for a Site Reliability Engineer to ensure our systems run smoothly and reliably at scale. Your expertise in monitoring, observability, and system automation will help maintain the high availability and performance our customers depend on. You will work at the intersection of development and operations, using your technical skills to build robust infrastructure and streamline deployment processes.
Your mission will be to proactively identify and resolve system issues before they impact our customers. You will collaborate closely with development teams to implement monitoring solutions, create comprehensive alerting systems, and develop the tools needed to maintain system reliability. Initially, you will focus on enhancing our existing monitoring and alerting infrastructure, then gradually build selfhealing systems and selfservice capabilities that empower teams to diagnose and resolve issues independently.
As part of this role, you will:

Design and implement comprehensive alerting systems that detect issues early and provide actionable insights to streamline the resolution of these issues.

Collaborate with our development teams to ensure our observability stack provides clear visibility into system health and performance.

Optimise oncall processes, including creating and maintaining detailed runbooks that enable efficient incident response and knowledge sharing across teams.

Build selfhealing systems using AI tools that automatically resolve common issues before they require human intervention.

Develop automation tools and diagnostic capabilities that help teams quickly identify and resolve issues when manual investigation is required.

Ensure secure and reliable code deployment processes through robust CICD pipelines and infrastructure automation.

Join our 247 support rotation which provides firstlevel platform support to ensure a great customer experience.

Requirements
You
We are looking for someone who is excited about building innovative solutions and wants to have a large impact in a smaller company; you will be a key part of defining Unitary’s future during this early stage of our new product strategy. We need versatile people who are happy to get stuck into whatever needs doing, and are ready to learn and grow with the company.
For this particular role, we need a collaborative engineer who excels at working across teams and can translate complex technical concepts into actionable solutions. You should be comfortable balancing your time between fixing urgent issues and investing in proactive system improvements. Communication is crucial, as youll be working closely with multiple engineers and may need to coordinate during highstress incident situations.
We would love to hear from you if:

Have worked with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance

Are proficient with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data

Have experience with incident management tools such as Incident.io for coordinating response efforts and recording followup learnings and actions

Can demonstrate strong problemsolving skills and the ability to work autonomously

Are confident writing production code in languages such as Go or Python

Thrive in a collaborative environment where group output and team achievements weigh heavier than individual input

It would be even better, but not essential, if you have:

Experience working in a fully remote, international team

Previous startup experience

Built Slack bots or similar automation tools to streamline team workflows

Experience with CICD platforms for building reliable deployment pipelines (e.g. GitLab CI, ArgoCD)

Worked with Kubernetes and infrastructure as code tools such as Terraform for scalable system deployment

Are familiar with MLOps practices and tools, and monitoring machine learning systems in production

This role can be placed anywhere within 3 hours of the UK time zone.
Benefits
About us
The team
Unitary is a remotefirst team of c. 20 people spread across Europe and North America who are fiercely passionate about making the internet a safer place, and deeply motivated to become a force for good. We have an ambition to create a company filled with happy, kind and collaborative people who achieve extraordinary things together. Our culture is built around the power of trust, transparency and selfleadership.
Working at Unitary
We are committed to creating a positive and inclusive culture built on genuine interest for each others wellbeing. We offer progressive and marketleading benefits, including:

Flexible hours and location

Competitive salary and equity package

Occupational pension

Generous paid parental leave

Generous paid sick leave

Annual budget for your professional development and growth

Annual budget for your individual health and wellness

Three team offsites to London or other exciting destinations in Europe

Required profile

Experience

Level of experience: Mid-level (2-5 years)

Spoken language(s):

English

Check out the description to know which languages are mandatory.

Hard Skills

Automated Information Systems Incident Management Observability Continuous Monitoring Production Code Kubernetes Prometheus (Software)Grafana Telemetry CI/CD Infrastructure Automation InfluxDB Terraform CI/CD MLOps (Machine Learning Operations)

Other Skills

Collaboration
Communication
Problem Solving

Are you interested?

Share

Site Reliability Engineer (SRE) Related jobs

Site Reliability Consultant

Site Reliability Consultant

Site Reliability Consultant

1 day ago

Pythian

Full time

Google Cloud Platform (GCP)Automated Information SystemsLinux AdministrationKubernetes

Senior Site Reliability Engineer, Environment Automation

Senior Site Reliability Engineer, Environment Automation

Senior Site Reliability Engineer, Environment Automation

Today

GitLab

Full time

Incident ResponseTerraformKubernetesInfrastructure Automation

Senior Site Reliability Engineer, Environment Automation

Senior Site Reliability Engineer, Environment Automation

Senior Site Reliability Engineer, Environment Automation

Today

GitLab

Full time

Incident ResponseTerraformKubernetesGit (Version Control System)

Site Reliability Consultant

Site Reliability Consultant

Site Reliability Consultant

1 day ago

Pythian

Full time

Google Cloud Platform (GCP)Automated Information SystemsKubernetesLinux

Consultant, Site Reliability Engineering

Consultant, Site Reliability Engineering

Consultant, Site Reliability Engineering

Today

Visa

Full time

JavaScript (Programming Language)CI/CDKubernetesPython (Programming Language)

See more Site Reliability Engineer (SRE) jobs