Logo for HavocAI

Senior Site Reliability Engineer

Job description

About Us:

Collaborative autonomy is how self-tasking teams of machines will solve hard human problems, and HavocAI is an unquestioned leader in collaborative autonomy. We set the standard for autonomous surface vessels for a wide range of defense and commercial maritime missions. Success requires us to grow quickly, and we’re looking for teammates who are passionate about solving hard problems, about pushing the envelope, and about preventing conflict and saving lives. Ambition is welcome to apply within.

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) with 7+ years of experience designing, operating, and scaling highly reliable distributed systems. In this role, you will be a key technical leader within the Cloud Platform team, responsible for ensuring the availability, performance, and resilience of mission-critical services supporting autonomy, simulation, and data-intensive workloads.

You will work closely with Cloud Platform, DevOps, Data Engineering, and Autonomy teams to establish reliability standards, improve operational maturity, and build systems that scale safely under real-world conditions. The ideal candidate is deeply technical, calm under pressure, and experienced in owning reliability outcomes end-to-end.

Responsibilities

Reliability Engineering & Architecture

  • Design and evolve reliability architecture for distributed and cloud-hosted systems.

  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning.

  • Partner with platform and application teams to design systems for reliability, scalability, and operability.

  • Identify and mitigate systemic reliability risks across infrastructure and services.

Operations & Incident Management

  • Lead incident response processes including on-call rotations, escalation, and post-incident reviews.

  • Conduct root cause analysis for complex production incidents and drive long-term improvements.

  • Improve operational readiness through runbooks, automation, and resilience testing.

  • Reduce operational toil through tooling, automation, and process improvements.

Observability & Performance

  • Design and maintain observability systems for metrics, logging, tracing, and alerting.

  • Ensure services and data pipelines are observable, debuggable, and performant in production.

  • Drive performance analysis and tuning across infrastructure and service layers.

Automation & Platform Collaboration

  • Build automation to improve system reliability, deployment safety, and recovery processes.

  • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns.

  • Support and improve Kubernetes-based environments and containerized workloads.

Security & Resilience

  • Collaborate with security teams to ensure secure and resilient system design.

  • Participate in disaster recovery planning and testing.

  • Maintain strong operational practices around access control, secrets management, and change management.

Requirements

  • 7+ years of experience in SRE, infrastructure, or systems engineering roles.

  • Strong experience operating large-scale distributed production systems.

  • Deep understanding of Linux systems, networking, and distributed systems fundamentals.

  • Hands-on experience with Kubernetes and container orchestration.

  • Programming or scripting experience in Go, Python, or similar languages.

  • Experience designing and operating observability systems for production environments.

  • Proven ability to lead incident response and reliability improvements.

  • Strong communication skills and ability to collaborate across engineering teams.

  • Must be a US Citizen.

  • Must be Eligible to obtain a Government Clearance - if required.

Nice to Have

  • Experience supporting autonomy, robotics, simulation, or real-time systems.

  • Familiarity with AWS and large-scale cloud infrastructure.

  • Experience with chaos engineering, fault injection, or resilience testing.

  • Knowledge of CI/CD systems and progressive delivery practices.

  • Experience working in high-reliability or safety-critical environments.

Benefits:

  • 100% Employer paid Health, Dental and Vision Insurance for you and your families

  • Life Insurance (Employer Paid)

  • Ability to participate in the companies 401k program (Matching)

  • Unlimited PTO policy with an enforced 2 week minimum

  • Equity Package

  • Work / Home Office Stipend

  • Global Entry

  • 16 Week Paid Parental Leave

  • Monthly Health and Wellness Stipend


Our Values:

  • Innovation: We are driven to break new ground. Every day presents an opportunity to challenge the status quo, think boldly, and deliver advanced solutions that transform the future of defense technology.

  • Integrity: We hold ourselves to the highest ethical standards, ensuring transparency, accountability, and trust in all our actions and partnerships.

  • Mission-Driven: We are focused on achieving impactful outcomes that align with our core mission—protecting lives through innovation.

  • Forward-Leaning: We continuously seek out new opportunities and remain at the forefront of technological advancements. We embrace change and anticipate the challenges of tomorrow with confidence and creativity.

  • Ownership of All Tasks: At HavocAI, no problem is too complex or too trivial. We believe that greatness comes from tackling the hardest challenges, but also in handling the smallest, sometimes thankless, tasks with the same level of commitment and care.

  • Servant Leadership: We lead by serving others, whether it’s supporting our employees, partners, or the broader community. Empowering those around us is key to achieving long-term success and making a lasting impact.

HavocAI is an Equal Opportunity Employer and is committed to creating an inclusive and diverse workplace. We welcome applicants from all backgrounds and do not discriminate based on race, color, religion, gender, sexual orientation, age, national origin, disability, veteran status, or any other legally protected status.

Site Reliability Engineer (SRE) Related jobs

Other jobs at HavocAI

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.