Key Facts

Remote From:

United Kingdom

Full time

English

Hard Skills

Other Skills

•
Calmness Under Pressure
•
Accountability
•
Collaboration
•
Leadership
•
Problem Solving

Job description

The Role

We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.

This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.

This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.

About Lyrebird Health

Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.

They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.

What You'll Do

Reliability & Production Engineering

Own reliability outcomes across core services and customer-facing systems

Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets

Lead initiatives to improve uptime, latency, and overall system resilience

Proactively identify reliability risks and drive mitigation plans to completion

Observability & Incident Response

Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly

Lead incident response for high-severity events and guide teams through calm, effective mitigation

Drive post-incident reviews that result in measurable, lasting improvements

Build a culture of operational excellence: fewer incidents, faster recovery, better learning

Platform Enablement

Develop internal tooling and paved paths that make “doing the right thing” the easiest option

Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership

Partner with engineers to uplift production-readiness across new and existing services

Infrastructure & Automation

Improve infrastructure reliability and maintainability using Infrastructure as Code

Strengthen deployment workflows and reduce operational toil through automation

Help shape architecture decisions with a reliability and scalability lens

Security & Compliance Support

Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)

Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

What We’re Looking For:

8+ years of engineering experience, with significant depth in SRE / platform/production systems

Strong experience operating and improving systems in production (including incident response)

Proven ability to lead cross-team initiatives and influence engineering standards

Technical StrengthYou don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)

Infrastructure as Code (Terraform)

Observability

Strong grasp of monitoring and alerting principles

Experience with logs + metrics + tracing and building meaningful dashboards

Familiar with OpenTelemetry and modern observability tooling

Systems & Operational Excellence

Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning

Strong debugging instincts across distributed systems

Practical approach to risk management and tradeoffs

Software Engineering

Ability to build tools and automation (TypeScript, Go, Python, or similar)

Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Bonus Skill (Nice to Have):

Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)

Experience with service mesh patterns, multi-account AWS environments, or multi-region design

Experience working with healthcare or regulated domains

Experience scaling engineering org practices as the company grows

Who You Are:

You’re deeply accountable - you take ownership of outcomes, not just tasks

You value simplicity and reliability over cleverness

You’re calm and effective in incidents, and you raise the quality bar afterward

You communicate clearly across engineering and non-engineering stakeholders

You’re pragmatic: you know when to move fast, and when to slow down to reduce risk

Why This Role Is Different:

Staff-level scope with real influence across engineering

Direct impact on reliability for a product clinicians depend on every day

Work on meaningful problems where security, performance, and trust matter

High ownership environment with room to shape how the company operates at scale

At Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.

We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.

Ready to apply?

APPLY

Share ·