Senior Site Reliability Engineer , Scalability

In Meraki SRE we build the highly scalable cloud infrastructure that supports millions of Meraki devices worldwide. Meraki’s customer base has grown by a factor of 2-3 every year, serving more than 8 billion HTTP requests per day across ten data centers! Our customers depend on our products to run their critical infrastructure of network switches (now including Cisco Catalyst in addition to the Meraki switches), security appliances, wireless APs, security cameras and sensors. In addition, Meraki offers a range of SaaS solutions on its Dashboard that greatly improves the insight and experience of IT departments around the globe.

The Infrastructure SRE team is responsible for the compute, storage and security underpinning Meraki's cloud in 10 data centers worldwide. Meraki's high growth rate means our processes must be automatic and efficient, never driven manually. Automation, monitoring and a keen eye for technical debt are key.

As a member of the team, you will craft and develop the global infrastructure which supports our cloud; this might mean deploying new Infrastructure management technologies at scale, writing code and using workflow orchestrators to improve our provisioning and decommissioning processes, or building models to predict business demand. You will also work closely with our vendors and internal Datacenter Operations team. We follow the *nix way (build large systems out of small components that each does one job and does it well. We run Debian and Ubuntu), automate tedious tasks and work almost entirely with infrastructure-as-code.

This role is to support a specific customer, and there is a 24/7 on-call requirement as part of a rotation. You will work with your team to deliver technical projects to support the wider business while spending a portion of your time working cross-team to support this critical customer.

Projects include:

Deploying and running IaaS solutions that let teams run seamlessly between our private cloud and AWS.
Using and running a workflow orchestrator (i.e. Luigi, Apache Airflow, Argo, etc.) to continuously reduce any manual work required to run our Infrastructure.
Security hardening for services, packages and processes, including OS images.
Building an automatic service lifecycle platform to coordinate the full lifecycle of all infrastructure (server, storage, network and site).
Deploying comprehensive monitoring tools to provide insight into the performance and reliability of our infrastructure.
Automating testing infrastructure to accelerate the velocity at which we can deploy changes.

You are an ideal candidate if:

Enjoy and have experience leading large technical projects - particularly working with cloud systems, networking, distributed systems, or data processing frameworks (ETL pipelines).
Have at least 4 years of experience
Have experience running and/or developing highly scalable IaaS solutions.
Have experience with automation tools such as Ansible and Terraform as well as container technologies (Docker or similar).
Script or code with 1-2 languages like Ruby, Scala, and Python. Enjoy digging into other people’s source code (even if you don't know the language) in search of the root cause of a problem. Instinctively write code to deploy and automate infrastructure.

Bonus points for:

Exciting personal projects or contributions to open-source.
Experience working with ETL pipelines.
Experience deploying HA services on K8s
Experience working with Cellery

Senior Site Reliability Engineer , Scalability

Offer summary

Qualifications:

Key responsibilities:

Job description

Projects include:

You are an ideal candidate if:

Bonus points for:

Required profile

Experience

Hard Skills

Other Skills

Site Reliability Engineer (SRE) Related jobs

Site Reliability Engineer III

Site Reliability Engineer

Site Reliability Engineer

Midlevel Site Reliability Engineer (31693)

Senior Site Reliability Engineer