Logo for Hostelworld Group

Senior Site Reliability Engineer

Key Facts

Remote From: 
Full time
Senior (5-10 years)
English

Other Skills

  • Success Driven
  • Collaboration
  • Adaptability
  • Communication
  • Open Mindset
  • Teamwork
  • Problem Solving

Roles & Responsibilities

  • Minimum of 3 years of professional experience in a senior SRE role
  • Deep expertise in at least one major cloud provider (GCP, AWS, or Azure)
  • Extensive hands-on experience with Kubernetes (production-grade clusters, container orchestration, service mesh, and scaling)
  • Proficiency in Python or NodeJS for automation, tooling, and building CI/CD pipelines (GitOps familiarity preferred)

Requirements:

  • Design, build, and maintain scalable, highly available, and secure systems on GCP; define and track SLIs/SLOs and manage the error budget
  • Automate operational tasks and infrastructure management, build CI/CD pipelines, and implement IaC to reduce toil
  • Apply AI/ML to enhance observability and incident response, including AI-powered anomaly detection, predictive analytics, automated root cause analysis, and blameless post-mortems
  • Manage cloud infrastructure with Kubernetes and IaC, implement comprehensive observability (monitoring, alerting, logging), and participate in on-call incident management

Job description

ABOUT US 🚀

Our vision is to be the world’s leading social travel platform.

We are on a mission to help travellers find their people and create unforgettable moments together. Connections matter as much as destinations, and we are building a global ecosystem to enable them.

Since the launch of our social network in the hostelling category in 2022, Hostelworld has experienced transformational growth. We have successfully welcomed over 2.6 million social members to our platform, proving the strong demand for connection in travel. This vibrant community is not just booking trips; they are actively engaging, co-creating, and becoming our most powerful brand advocates. The platform is rich with user-generated content, including a rapidly growing volume of customer testimonials and authentic travel stories that provide powerful social proof.

The growth in engagement is remarkable: the number of messages sent between travellers has grown significantly faster than the number of trips booked by social members, demonstrating the deep social utility of our platform. This momentum is reflected in our robust financial health, underscored by a strong balance sheet powered by an asset-light, cash-generative business model.

We are not just another Online Travel Agent (OTA); we have created a new category of travel altogether: Social Travel. Our singular focus on helping travellers find people to hang out with is the foundation of our strategy. It has allowed us to build a powerful and defensible market position, which attracts highly valuable customers and shifts more of our business onto our mobile-native apps.

While our app and social features create a sticky user experience, our true competitive moat stems from the incredibly rich, proprietary data set generated by our social network. As our community grows, so does the value of this data, creating a compounding network effect that is nearly impossible to replicate. This "social flywheel" allows us to understand traveller behaviour, predict needs, and personalise experiences in ways that generalist OTAs cannot match. It is the engine that will power our long-term, differentiated growth and solidify our position as the sole player in the Social Travel category we created.

At our Capital Markets Day in April 2025, we unveiled our ambitious strategy to build on our success and realise our vision over the next three years. This strategy is organised around five key themes:

  • Strengthening the Core OTA Platform: We will continue to enhance our core booking engine, ensuring a seamless and efficient experience for our customers and hostel partners, further solidifying our position as the leading platform for hostel discovery and booking.

  • Building 'Pre-Booking' Social Features: We are developing innovative features like 'Travel Plans', designed to engage users much earlier in their travel journey. These features will accelerate the creation of social profiles and drive app downloads, fuelling our social flywheel before a single bed is booked.

  • Leveraging AI for an Enhanced Social Experience: We will embed AI into our pre- and post-booking features to deliver unparalleled personalisation and value. This includes AI-powered recommendations for people to meet, places to go, and things to do, making our social network an indispensable travel companion.

  • Expanding our Addressable Market: We will expand our reach and product portfolio through targeted partnerships and M&A. This will allow us to cater to a broader segment of the youth travel market and integrate new, complementary services into our social travel ecosystem.

  • Building New Revenue Streams: Our social network is more than an engagement tool; it is a new platform for monetisation. We will develop new revenue streams directly from our social features, creating value for our users and our business in ways that traditional OTAs cannot.

Our Culture

At Hostelworld, our culture is a direct reflection of our customers: adventurous, curious, and social. We have a shared love of travel that fuels our work and connects us on a deeper level. It’s a fast-paced environment that blends the agility and "scrappy" resourcefulness of a start-up with the experience and ambition of a global, publicly listed company.

We are a team of pragmatic optimists who are data-obsessed and results-driven. We value a ‘test and learn’ mindset, encouraging experimentation and empowering our teams to take calculated risks. We believe in doing the right thing—for our customers, our partners, and each other. We foster a supportive and collaborative atmosphere where diverse perspectives are celebrated, and where every team member has the opportunity to make a significant impact on our journey. We embrace the journey, not just the destination, and we seek leaders who will thrive in a dynamic environment and inspire their teams to do the same.

LOCATION🌍

This role is based in Portugal. We have an office hub in Porto available for those who prefer a hybrid model where you can spend time with colleagues in-person. You will need to be able to commute to our office hub as required from time to time.

WHO YOU'LL WORK WITH 👨🏽‍🤝‍👨🏼

This role is a hybrid of software engineering and systems administration, where you will use your coding skills to automate operational tasks, manage infrastructure as code, and reduce manual toil. Our mission is to ensure the reliability, scalability, and performance of our systems, with a specific focus on our Google Cloud Platform (GCP) infrastructure.

You will be a key contributor to our blameless post-mortem culture and will be part of an on-call rotation to respond to and resolve critical incidents.

WHAT YOU'LL DO 👩‍💻

  • System Reliability and Performance: Design, build, and maintain scalable, highly available, and secure systems on GCP. Define and track key metrics such as Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the error budget to balance reliability and innovation.

  • Automation and Toil Reduction: This is a core function of the role. You will be responsible for creating sustainable systems and services through automation. You will develop and implement automation for routine operational tasks, infrastructure management, and CI/CD pipelines to improve efficiency and reduce human error.

  • AI Integration: Leverage AI and machine learning tools to enhance SRE practices. This includes using AI-powered observability to analyze vast amounts of data and detect subtle anomalies and patterns that would be invisible to human observers, as well as using AI for predictive analytics to anticipate potential issues before they impact users. You will also use AI tools to automate root cause analysis and streamline incident response.

  • Cloud Infrastructure Management: Use Infrastructure as Code (IaC) principles to manage our GCP environment, including container orchestration platforms like Kubernetes.

  • Incident Management and Response: Participate in the on-call rotation to respond to and resolve critical incidents. Conduct blameless post-mortems and implement long-term fixes to prevent recurrence.

  • Observability: Implement and maintain comprehensive monitoring, alerting, and logging solutions to provide real-time visibility into system health and performance. You will build monitoring that alerts on symptoms rather than on outages.

WHAT WE’RE LOOKING FOR 👀

  • Experience: A minimum of 3 years of professional experience in a senior technical SRE role is required

  • Cloud Ecosystems: Deep expertise in at least one major cloud provider (GCP, AWS, or Azure). While our stack is GCP-native, we value candidates who understand core cloud-agnostic architecture and can translate their skills across platforms.

  • Kubernetes Mastery: Extensive hands-on experience with Kubernetes (K8s) is a core requirement. You should be comfortable managing production-grade clusters, including container orchestration, service mesh, and scaling strategies.

  • Programming and Scripting: Proficiency in Python or NodeJS is required for automation, tool development, and general scripting.

  • Data Persistence: Practical experience managing and scaling both relational and NoSQL databases, such as MariaDB/MySQL and MongoDB.

  • CI/CD: Experience with CI/CD tools and building automated pipelines. Familiarity with GitOps workflows and tools

  • Operational Mindset: A proactive, "break-it-to-fix-it" approach. You are passionate about observability (SLIs/SLOs), incident response, and performing blameless post-mortems to prevent future issues.

  • Communication: The ability to articulate complex technical issues and solutions to both technical and non-technical stakeholders.

WHAT WE OFFER 💯

  • Competitive salary & benefits 

  • Enhanced annual leave plus 3 Wellbeing Days per year 

  • Paid family leave (maternity, paternity, surrogacy & adoption) 

  • Agile working (plus a Working from Abroad Policy!) 

  • Support for your ongoing growth & development 

  • Inclusive people policies (sickness, menopause, compassionate and fertility leave) 

  • A chance to give back to your local community with 5 paid volunteering days

OUR BEHAVIOURS 🏆

  • Grow others - We fundamentally believe that investing in growing others benefits everyone, whether it's helping them develop hard or soft skills. We want learning and growing to be part of our DNA to help makes us a better team, together.

  • Master it - We are obsessed with our area of expertise and enjoy developing our skills. We rarely take things at face value; we investigate, interrogate, and always look for ‘the why,’ and wherever possible, we use data to find the best solution.

  • Collaborate - We are in it together, for the tough stuff and the celebrations too. To achieve the best results, we need expertise from all areas of the organisation, and we wholeheartedly welcome diverse thinking.

  • Adapt - We work fluidly, adapting to new information and the evolving environment while staying committed to our goals. Innovation and experimentation fuel our projects and we’re never afraid to pivot.

  • Deliver - Our focus is always on the end result; we value outcomes over activity. We collaborate to deliver work at speed without dropping any of our other behaviours.

We believe in talented and diverse teams that reflect the diversity of our customers and the communities in which we operate. Everyone brings different perspectives and experiences. We lay out the above requirements to guide applicants to the experiences that we believe will allow you to be successful in the role. If you don’t meet them all, please consider applying if you think you can still perform the role as described.

Site Reliability Engineer (SRE) Related jobs

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.