Key Facts

Remote From:

Full time

Senior (5-10 years)

English

Hard Skills

Root Cause Analysis Incident Management Datadog Ceph (Software) Linux Prometheus (Software) Site Reliability Engineering VMware Virtualization Incident Management Capacity Planning +16 more

Other Skills

•
Troubleshooting (Problem Solving)
•
Data Reporting
•
People Management
•
Accountability
•
Teamwork
•
Problem Solving

Roles & Responsibilities

Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
7+ years of experience in systems operations, site reliability, or platform engineering, including at least 2+ years in a leadership/major operational function
Proven incident management experience in a 24/7 production environment with strong troubleshooting and root cause analysis skills
Experience with change management practices and platform tooling for monitoring/observability (e.g., Datadog, Prometheus, Grafana, New Relic) and incident management tools (PagerDuty, Opsgenie); familiarity with Linux, VMware, Ceph, and cloud platforms

Requirements:

Own and continuously evolve Reliability Operations by leading incident command, standardizing incident declaration, severity, escalation, and cross-team communications; manage 24/7 incident responses and on-call rotations
Drive learning and accountability by leading post-incident reviews with strong root cause analysis and documentation; translate incident trends into actionable reliability improvements and ensure corrective actions are completed
Define and maintain service performance and reliability targets, own the observability strategy (monitoring, alerting, signal quality), and improve detection and time to resolution; align reliability with platform roadmaps
Collaborate across Engineering, Security, and Operations to ensure reliability across virtualization, storage, Linux, and hybrid cloud environments; provide regular, data-driven reporting to leadership on availability and incident trends; scale practices across multiple products and brands

Nexcess

About Nexcess

Nexcess is the best place to build your business online. Optimized for your hosting and solution needs, we provide a managed hosting infrastructure, curated tools, and a team of experts that make it easy to build, manage, and grow your business online. Serving SMBs and the designers, developers, and agencies who create for them for more than 22 years, we provide a fully managed, high-performance cloud solution built to optimize WordPress, WooCommerce, and Magento sites and stores. Nexcess holds data centers worldwide that deliver performance, reliability, auto-scaling, and management control through our best-in-class open stack cloud platform. As a point of pride, Magento was invented on Nexcess servers. Nexcess is a brand within the CloudOne Digital portfolio of Cloud capabilities owned by One Equity Partners. CloudOne Digital is an innovative portfolio of cloud-based solutions focused on the needs of online businesses. Offering best-in-class infrastructure and cloud capabilities spanning the needs of small entrepreneurs, small and midsize businesses, and midmarket enterprise workloads – all with the support online businesses need to grow and succeed. For more information, visit cloudonedigital.com.

Founded: 2018

Company size: 51 - 200

Website LinkedIn See all jobs →

Job description

Description

About Nexcess

Nexcess brings together a portfolio of hosting, cloud, and digital experience brands to deliver high-performance infrastructure and services to businesses worldwide.

Our platforms power mission-critical applications for thousands of customers. Reliability is foundational to everything we do. We operate complex environments spanning virtualization, storage, networking, and application hosting; where performance, availability, and consistency matter at scale.

This is a permanent, full-time, remote position.

US Pay Band - $110K - $150K Actual compensation will vary based on experience, skills, and location.

About the Role

We’re looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.

This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You’ll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.

What You’ll Do

Own Reliability Operations & Incident Command

Continuously evolve and improve incident management, change management, and post-incident practices
Establish clear standards for incident declaration, severity, escalation, and communication
Ensure consistent execution across teams and continuous process improvement

Own the incident command function, including roles, structure, and operating procedures
Lead or oversee major incident response in a 24/7 production environment
Build and manage on-call incident commander rotations with global coverage

Drive Learning, Accountability & Reliability Strategy

Own post-incident reviews, ensuring strong root cause analysis and clear documentation
Translate incident trends into actionable reliability improvements
Drive completion of corrective actions across teams; escalate when needed
Define and maintain service performance and reliability targets (availability, latency, error rates)

Own observability strategy, including monitoring, alerting, and signal quality
Improve detection, reduce time to resolution, and increase platform resilience

Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
Ensure reliability insights directly inform platform and infrastructure roadmaps
Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment

Operate Across a Complex Platform Environment

Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
Support platforms that span dedicated hosting, managed applications, and high-availability cloud services
Ensure reliability practices scale across multiple products, brands, and customer environments

Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
Act as the central authority on reliability insights across teams

What You Bring

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
7+ experience in systems operations, site reliability, or platform engineering
2+ years experience leading teams or major operational functions
Proven experience managing incidents in a 24/7 production environment
Strong background in troubleshooting, root cause analysis, and operational improvement
Experience with change management practices

Platform & Tooling Experience

Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
Logging and telemetry systems (centralized logging, metrics, tracing)
Ability to translate complex technical data into clear insights
Strong communication skills, especially in high-pressure situations

Nice to Have

Background in Computer Science, Engineering, or a related field
Experience in managed hosting, cloud infrastructure, or SaaS environments
Experience defining and tracking system reliability and performance targets
Familiarity with ITIL or similar operational frameworks
Exposure to VMware, Ceph, Linux, and Windows platforms
Relevant certifications (AWS, RHCE, etc.)

We Offer:

Traditional and Roth 401k with company matching
A collaborative team culture
Consistent/set work hours
Challenging non-redundant daily duties
A voice in how things get done

Disclaimer:

This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Liquid Web Inc. reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains “at will”.

Equal Employment Opportunity Policy: Liquid Web is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.

#LI-Remote

Ready to apply?

APPLY

Share ·