Logo for Vespa.ai

Principal Site Reliability Engineer

Key Facts

Remote From: 
Full time
Senior (5-10 years)
English

Other Skills

  • Problem Solving
  • Analytical Skills
  • Communication
  • Teamwork
  • Time Management
  • Leadership

Roles & Responsibilities

  • 5–10 years building and operating large-scale production systems, with deep SRE/DevOps experience.
  • Solid programming skills in Java, Python, Go, or similar languages.
  • Good understanding of sound software engineering principles and practices.
  • Experience with cloud platforms (AWS, Azure, or GCP).

Requirements:

  • Help ensure the reliability, availability, and performance of Vespa Cloud production systems running globally at scale.
  • Participate in a 24x7 on-call rotation (approximately every 3rd–4th week), lead incident response, and drive blameless postmortems through to durable fixes.
  • Help define and track SLOs/SLIs, and build proactive alerting, capacity planning, and remediation strategies.
  • Design and improve observability — metrics, logging, and tracing — across a large fleet.

Job description

Does it sound interesting to work on an open source platform managing the data and real-time search and inference for some of the largest companies in the world? Would you thrive on keeping large, globally distributed systems reliable, fast, and observable — and on building the practices and tooling that let a small team operate at massive scale? If so, we want you to join our team at Vespa.ai as a Principal Site Reliability Engineer!

About Vespa.ai:
Vespa.ai is a team of passionate builders. We maintain and develop the Apache 2.0 licensed open-source AI search platform Vespa. 

Vespa is a fully featured search engine and vector database. It supports vector search (ANN), lexical search, and structured data search, all in a single query. Integrated machine-learning model inference enables the application of AI to make sense of data in real time. Together with Vespa’s proven scalability and high availability, this empowers to create production-ready search applications at any scale and with any combination of features. Our users and customers are #1 in e-commerce, content, and financial services globally, and are used by companies such as Perplexity, Spotify, Yahoo, Wix, and many more.

In addition to our open-source platform, Vespa.ai develops and runs Vespa Cloud, a robust SaaS offering that allows businesses to harness the power of our technology with ease.

At Vespa.ai, we are extremely focused on automating everything we do to grow fast and maintain high quality. In all roles, we scale through technology, not simply by adding larger teams. We take pride in being small, nimble, and the most productive.


Position overview
At Vespa.ai, we embrace DevOps as a company culture, seeking to solve technical problems with automation and code rather than repetitive manual effort. For our Vespa Cloud production systems, we have had this mindset from day one.

We are seeking a Principal Site Reliability Engineer to join our team and help keep Vespa Cloud reliable, fast, and observable at global scale. This is a senior individual contributor role on the team that operates and improves our production systems. You will also help shape and develop our approach to SRE and DevOps as we grow. We are looking for a strong engineer who earns influence through contributions and has the ambition to take on greater responsibility over time. You will also participate in our 24x7 on-call rotation, approximately every third to fourth week.

At our Trondheim office, we work office-first: you will be based on-site most of the time, with the flexibility to work from home/remotely when needed, as agreed with your manager.


Responsibilities

  • Help ensure the reliability, availability, and performance of Vespa Cloud production systems running globally at scale.
  • Participate in a 24x7 on-call rotation (approximately every 3rd–4th week), lead incident response, and drive blameless postmortems through to durable fixes.
  • Help define and track SLOs/SLIs, and build proactive alerting, capacity planning, and remediation strategies.
  • Design and improve observability — metrics, logging, and tracing — across a large fleet.
  • Eliminate operational toil by solving problems with automation and code rather than manual effort.
  • Contribute to, and help shape, our SRE and DevOps practices and culture as the organization grows, sharing knowledge and mentoring across the team.
  • Work with the rest of the Vespa.ai developing team on reliability, scalability, and architecture.

Qualifications

  • 5–10 years building and operating large-scale production systems, with deep SRE/DevOps experience.
  • Solid programming skills in Java, Python, Go, or similar languages.
  • Good understanding of sound software engineering principles and practices.
  • Experience with cloud platforms (AWS, Azure, or GCP).
  • Solid understanding of networking, operating systems, distributed systems, and security principles.
  • Proven incident management and on-call experience.
  • A track record of influencing technical direction and improving how teams work — not just executing tickets.
  • Excellent problem-solving and analytical skills, and the ability to lead through influence as well as work independently.

Desired Skills

  • Experience with Infrastructure as Code tools such as Terraform, Tofu, Spacelift, etc.
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry, ELK).
  • Experience with CI/CD tooling such as GitHub Actions, Buildkite, etc.
  • Experience operating data-intensive or stateful systems at scale.
  • Experience defining SLOs and establishing reliability programs.
  • Ambitions beyond pure SRE — an interest in growing, over time, into a technical leadership role.

Some of Our Tools and Services

  • JumpCloud, Google Workspace, and Slack
  • GitHub Enterprise Cloud (including GitHub Actions)
  • Jira Cloud and Jira Service Desk
  • StrongDM, Grafana, Spacelift, and Buildkite
  • AWS, GCP, and Azure

Why Join Us:

  • Opportunities for professional growth and development as part of one of Europe’s most exciting start-ups!
  • Be part of a cutting-edge team working on innovative search and recommendation technology.
  • Work on a team where we don’t believe in silos between engineers; there aren’t “developers”, “ops people”, and “sysadmins”. We’re all engineers solving problems the smart way together!
  • Competitive salary and benefits.


Note: Vespa.ai is an equal-opportunity employer. We are committed to creating an inclusive environment for all employees. We believe in fostering a collaborative and inclusive environment where every team member has the opportunity to make a significant impact.

Site Reliability Engineer (SRE) Related jobs

Other jobs at Vespa.ai

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.