Key Facts

Remote From:

United States

Category: Engineering Manager

Full time

Senior (5-10 years)

English

Hard Skills

Site Reliability Engineering Incident Management DevOps Observability Amazon Web Services Microsoft Azure gRPC Reliability Engineering Apache Kafka

Other Skills

•
Communication

Roles & Responsibilities

5+ years of experience in site reliability engineering, DevOps, or production operations
Hands-on experience with incident management tooling and observability stacks
Strong fluency with reliability concepts
Proficiency in Go or comparable systems language

Requirements:

Drive process improvements across the incident lifecycle
Coordinate the on-call program across multiple geographies
Select incidents for post-incident review and facilitate reviews
Build AI agents to automate operational toil

Job description

Redpanda is pioneering the Agentic Data Plane (ADP) - a new category in AI infrastructure that makes it simple and secure to connect AI agents with enterprise data and systems. Built on a multi-modal data streaming engine, Redpanda empowers agentic applications that reason and act in real-time with speed, autonomy, and precision.

Global leaders including Activision Blizzard, Cisco, Moody's, Texas Instruments, Vodafone and 2 of the top 5 banks in the U.S. rely on Redpanda to process hundreds of terabytes of data a day.

Backed by premier venture investors Lightspeed, GV and Haystack VC, Redpanda is a diverse, people-first organization with teams distributed around the globe.

About the Role:

We're looking for a Staff Production Operations Engineer to drive Redpanda's reliability operations program. This role combines hands-on site reliability engineering with planning and coordination skills to ensure a world-class operations practice across a globally distributed engineering team.

In this role, you'll work with the broader Engineering team, Engineering leadership, Product and Customer Success to drive operational excellence. You'll coordinate our on-call and incident lead rotations, drive blameless post-incident reviews, and own the processes that help us respond faster, learn more from outages, and systematically improve reliability. We're looking for someone who can leverage AI agents to automate the operational toil that slows teams down, building on Redpanda's own ADP platform to do it.

You Will:

Drive process improvements across the incident lifecycle: severity models, triage enforcement, alert noise reduction, and follow-up completion rates
Coordinate the on-call program across multiple geographies: manage schedules and shadow rotations, onboard new engineers, and ensure consistent coverage
Select incidents for post-incident review, facilitate blameless post-incident reviews, document findings, and track follow-up completion. Contribute to addressing incident follow-ups where possible, either by fixing issues directly or prototyping solutions
Build AI agents to automate operational toil, including oncall automation, as well as incident summarization, post-incident reviews prep, follow-up tracking, and on-call analytics
Maintain runbooks, playbooks, and incident process documentation, and keep them current as processes evolve

You Have:

5+ years of experience in site reliability engineering, DevOps, or production operations in large-scale, highly reliable environments
A track record of leading initiatives end-to-end, from design and planning, to execution and production operation
Hands-on experience with incident management tooling (incident.io, PagerDuty, or similar) and observability stacks (Datadog, Grafana, Sentry, CloudWatch, or equivalent)
Strong Fluency with reliability concepts: MTTD, MTTR, MTTA, error budgets, SLOs
Experience building automation and tooling to reduce operational toil
Proficiency in Go (or comparable systems language with willingness to ramp)
Experience with AI-assisted software development workflows including tools like Claude Code
Working knowledge of at least one of AWS / Azure / GCP, including infrastructure as code for system and network infrastructure
Strong written communication; ability to drive alignment across engineering teams without direct authority

Nice to Have:

Hands-on experience building agents or automations using LLMs
Familiarity with Redpanda, Apache Kafka, or other streaming infrastructure
Prior experience in a fast-growing B2B infrastructure or developer tools company

U.S. base salary range for this role is $220,000 - $256,000 (CA, NY, WA) and $211,000 - $250,000 (other US locations). Our salary ranges are determined by role, level, and location. We strive to consider each candidate's job-related skills, location, experience, relevant education or training to determine individual base salary. Your talent partner will share more about the specific salary range for your preferred location during the hiring process.

Please note that Redpanda uses artificial intelligence (AI) technology to assist in the screening and assessment of applications for this position. However, all final hiring decisions are made by our human hiring team.

Vacancy Status: This job posting is for an existing vacancy.

Join Redpanda if you’d enjoy being part of a fast-moving, diverse, people-first organization with team members around the globe and a culture based on trust, transparency, communication, and kindness. You'll dive into a nimble, high-impact team with the latest AI tools — and the budget to actually use them.

#LI-Remote

Ready to apply?

APPLY

Share ·

Engineering Manager Related jobs

United States Engineering Manager

Data Engineering Manager (UK Remote)

Today

IVC

Full time

Data EngineeringDatabricksSQL (Programming Language)Microsoft AzureTechnical Delivery Management

Senior Engineering Manager

Today

commonsku

Full time

Software EngineeringTechnical LeadershipProject ManagementSystems DesignPerformance Management

Engineering Manager, Growth

Today

GitLab

Full time

Cross-Functional CollaborationData-Driven Decision MakingProduct ManagementAI TestingProgramming Tools

Technical Engineering Manager - Cloud/Dev Ops

Today

Gorilla Logic

Full time

DevOpsTechnical LeadershipPlatform Design And DevelopmentCI/CDInfrastructure as Code (IaC)

Staff Engineer

Today

Prodege, LLC

Full time

Java (Programming Language)Back End (Software Engineering)Technical LeadershipDistributed ComputingMicroservices

Other jobs at Redpanda Data

Senior Software Engineer, Connectors

1 day ago

Redpanda Data

Full time
Senior (5-10 years)

Distributed ComputingConcurrency PatternStream ProcessingComputational IntelligenceJIRA

Forward Deployed Engineer, Networking

1 day ago

Redpanda Data

Full time

Network ArchitectureProject IntegrationPrivate NetworksScriptingInfrastructure as Code (IaC)

Business Development Representative

7 days ago

Redpanda Data

Full time
Mid-level (2-5 years)

Customer EngagementSalesforceData StreamingSoftware SalesApache Kafka

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.