The Role
We’re looking for a Staff Site Reliability Engineer (SRE) to raise the reliability, scalability, and security bar across the Lyrebird platform.
This is a senior, high-impact role focused on designing and evolving the systems and practices that keep Lyrebird fast, safe, and available. You’ll work across infrastructure, application reliability, observability, incident response, and platform enablement - partnering closely with Engineering, Security, and Product.
This is not a “keep the lights on” role. You’ll drive meaningful improvements to how we build, deploy, and operate our services in production - with real autonomy and ownership.
About Lyrebird Health
Lyrebird Health is transforming the quality and accessibility of healthcare by automating clinicians’ most time-consuming tasks. Thousands of clinicians across many disciplines already use Lyrebird — and that number is growing every day.
They trust us to deliver a fast, reliable, and secure experience. We value that trust above all else and strive to earn it while continuing to amaze our users.
What You'll DoReliability & Production EngineeringOwn reliability outcomes across core services and customer-facing systemsDefine, implement, and evolve SLOs/SLIs, alerting strategy, and error budgetsLead initiatives to improve uptime, latency, and overall system resilienceProactively identify reliability risks and drive mitigation plans to completionObservability & Incident ResponseImprove end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quicklyLead incident response for high-severity events and guide teams through calm, effective mitigationDrive post-incident reviews that result in measurable, lasting improvementsBuild a culture of operational excellence: fewer incidents, faster recovery, better learningPlatform EnablementDevelop internal tooling and paved paths that make “doing the right thing” the easiest optionImprove the developer experience around deployments, rollbacks, environment consistency, and service ownershipPartner with engineers to uplift production-readiness across new and existing servicesInfrastructure & AutomationImprove infrastructure reliability and maintainability using Infrastructure as CodeStrengthen deployment workflows and reduce operational toil through automationHelp shape architecture decisions with a reliability and scalability lensSecurity & Compliance SupportEmbed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down deliveryWhat We’re Looking For:8+ years of engineering experience, with significant depth in SRE / platform/production systemsStrong experience operating and improving systems in production (including incident response)Proven ability to lead cross-team initiatives and influence engineering standardsTechnical StrengthYou don’t need to tick every box, but you should be strong across most: Cloud/Infrastructure, AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)Infrastructure as Code (Terraform)ObservabilityStrong grasp of monitoring and alerting principlesExperience with logs + metrics + tracing and building meaningful dashboardsFamiliar with OpenTelemetry and modern observability toolingSystems & Operational ExcellenceKnowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planningStrong debugging instincts across distributed systemsPractical approach to risk management and tradeoffsSoftware EngineeringAbility to build tools and automation (TypeScript, Go, Python, or similar)Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)Bonus Skill (Nice to Have):Experience supporting security frameworks (SOC 2, ISO 27001, HIPAA-style environments)Experience with service mesh patterns, multi-account AWS environments, or multi-region designExperience working with healthcare or regulated domainsExperience scaling engineering org practices as the company growsWho You Are:You’re deeply accountable - you take ownership of outcomes, not just tasksYou value simplicity and reliability over clevernessYou’re calm and effective in incidents, and you raise the quality bar afterwardYou communicate clearly across engineering and non-engineering stakeholdersYou’re pragmatic: you know when to move fast, and when to slow down to reduce riskWhy This Role Is Different:Staff-level scope with real influence across engineeringDirect impact on reliability for a product clinicians depend on every dayWork on meaningful problems where security, performance, and trust matterHigh ownership environment with room to shape how the company operates at scaleAt Lyrebird, you won’t just respond to incidents - you’ll design the systems and standards that prevent them.
We’re building a team that reflects the diversity of the people who’ll benefit from our work. If you’re from an underrepresented background in tech, we especially encourage you to apply - even if you don’t meet every single requirement.