Offer summary

Qualifications:

5+ years of Linux SRE/DevOps experience managing 100+ bare-metal nodes., Deep knowledge of NVIDIA/AMD GPU servers and high-speed interconnects., Proven track record of maintaining ≥ 99.999% uptime in latency-sensitive environments., Strong skills in Go, Python, Bash, and Infrastructure-as-Code tools like Terraform and Ansible..

Key responsibilities:

Design and automate high-availability architectures for fleet reliability.

Build zero-touch CI/CD pipelines for rapid deployment and rollback.

Manage bare-metal lifecycle including firmware, drivers, and GPU tuning.

Lead incident response, automate root-cause analysis, and mentor the DevOps team.

Job description

Company Description

Runware is the fastest AI-as-a-Service platform for media generation

Runware is an AI-as-a-Service platform that delivers real-time inference at 5–10× lower cost than competitors. Our platform is purpose-built for speed & efficiency: custom GPU design, server setup, and datacenter architecture matched with performance-optimized software and a best-in-class API. Engineering teams who work with Runware save up to 80% on inference, improve response times, and scale instantly across 300K+ AI models, all through a single flexible API. Usage-based pricing and on-demand capacity are already battle-tested by Wix, OpenArt, NightCafe, Freepik, and thousands more. Backed by Insight Partners, a16z Speedrun, Begin Capital, and Zero Prime.

Join Runware to power the AI products that are changing the world

At Runware you’ll collaborate with the world’s leading AI teams, turning cutting-edge research into breakthrough products for thousands of clients. New models hit the market every week, and our job isn’t just to keep pace—it’s to stay two steps ahead, delivering unbeatable speed and performance every time.

That takes a special kind of teammate: driven, self-directed, lightning-quick to learn, and rock-solid reliable. If you thrive on building ambitious things with people who work hard, care for one another, and refuse to settle for “good enough,” you’ll feel right at home.

Resumés matter, but passion, grit, and proof of excellence matter more—whether you honed your skills in a research lab, at work, or taught yourself at 2 a.m. If that sounds like you, let’s talk.

About the Role

This is a full-time remote role for a DevOps Lead – Bare-Metal & GPU Infrastructure (Linux). The successful candidate will be responsible for ensuring 99.999% service availability and optimum usage/scale infrastructure ratios while shipping code across hundreds of Linux GPU servers in multiple data-center locations.

Responsibilities

Fleet reliability – design and automate HA architectures that tolerate node, rack, or site failure without user impact.
Ultra-fast delivery – build zero-touch CI/CD pipelines (GitOps, progressive rollout, instant rollback) that push config or container changes globally in under 10m.
Bare-metal lifecycle – PXE/Redfish/IPMI bootstrapping, firmware & driver orchestration, per-node GPU tuning, automated de-commissioning.
Kubernetes on metal – multi-cluster control-plane HA, GPU scheduling, CNI overlay (Cilium/Calico), MetalLB/Ingress → <50 ms failover.
Observability at scale – end-to-end metrics, logs, traces, actionable SLO dashboards, and predictive auto-healing.
Incident command – primary on-call lead; run blameless post-mortems and automate root-cause fixes.
Capacity bursts – script server bring-up (Ansible/Terraform/Cluster-API) so 100+ new GPUs go live in minutes.
Security & compliance – kernel-level hardening, secrets management, GPU multi-tenancy isolation, continuous CVE patching.
Mentorship – guide a small SRE/DevOps pod, set coding standards, and champion best practices.

In Your First 12 Months You Will:

Cut average deployment latency to ≤ 2m end-to-end, with one-click rollbacks.
Maintain ≤ 5 min total annual user-visible downtime (five nines) across all sites.
Automate server bring-up to <10 min from rack power-on to production workload.
Reduce P1 incidents by ≥ 60% through predictive alerting and auto-remediation.
Deliver fully auditable, Git-centric change pipelines adopted by 100% of engineering.

Requirements

5+ yrs Linux SRE/DevOps with 100+ bare-metal node fleets; 2+ yrs as technical lead.
Deep knowledge of NVIDIA/AMD GPU servers, high-speed interconnects (40 GbE+/InfiniBand/RoCE), NVMe/RDMA storage.
Proven record sustaining ≥ 99.999% uptime in latency-sensitive, high-variance demand environments.
Expert in Kubernetes on bare metal (Cluster-API, Kube-Virt, GPU Operator), advanced CNI, custom schedulers, and etcd care-and-feeding.
Strong skills in Go or Python, plus Bash; you write the tools you can’t find.
Infrastructure-as-Code mastery (Terraform, Ansible, Packer), GitOps workflows, and container build systems.
Monitoring/alerting stacks (Grafana), chaos/latency testing, synthetic probes.
Clear architectural thinking, crisp documentation, and calm communication under pressure.

Ready to architect zero-downtime, sub-minute rollouts for thousands of GPUs? Apply and let’s run the world’s AI together.

Benefits

We’re a remote-first collective, meeting in person twice a year to plan, brainstorm, celebrate wins, and enjoy some face-to-face time. We have core hours for cooperative working and calls, but outside of that your calendar is yours. Work the hours that let you perform at your peak while also building a healthy life.

Our release cycles are fast and intense, but they’re followed by real downtime. After big pushes we expect the team to unplug, recharge, and come back ready & stronger than ever for the next leap.