DevOps Lead BareMetal & GPU Infrastructure (Linux)

Work set-up: 
Full Remote
Contract: 
Experience: 
Senior (5-10 years)
Work from: 

Offer summary

Qualifications:

5+ years of Linux SRE/DevOps experience managing 100+ bare-metal nodes., Deep knowledge of NVIDIA/AMD GPU servers and high-speed interconnects., Proven ability to maintain ≥ 99.999% uptime in latency-sensitive environments., Expertise in Kubernetes on bare metal, Infrastructure-as-Code, and monitoring tools..

Key responsibilities:

  • Design and automate high-availability architectures for fleet reliability.
  • Build and maintain zero-touch CI/CD pipelines for rapid deployment.
  • Manage bare-metal lifecycle, including firmware, drivers, and GPU tuning.
  • Lead incident response and guide a small DevOps team.

Runware logo
Runware
2 - 10 Employees
See all jobs

Job description

Company Description

Runware is the fastest AIasaService platform for media generation

Runware is an AIasaService platform that delivers realtime inference at 5–10× lower cost than competitors. Our platform is purposebuilt for speed & efficiency: custom GPU design, server setup, and datacenter architecture matched with performanceoptimized software and a bestinclass API. Engineering teams who work with Runware save up to 80% on inference, improve response times, and scale instantly across 300K+ AI models, all through a single flexible API. Usagebased pricing and ondemand capacity are already battletested by Wix, OpenArt, NightCafe, Freepik, and thousands more. Backed by Insight Partners, a16z Speedrun, Begin Capital, and Zero Prime.

Join Runware to power the AI products that are changing the world

At Runware you’ll collaborate with the world’s leading AI teams, turning cuttingedge research into breakthrough products for thousands of clients. New models hit the market every week, and our job isn’t just to keep pace—it’s to stay two steps ahead, delivering unbeatable speed and performance every time.

That takes a special kind of teammate: driven, selfdirected, lightningquick to learn, and rocksolid reliable. If you thrive on building ambitious things with people who work hard, care for one another, and refuse to settle for “good enough,” you’ll feel right at home.

Resumés matter, but passion, grit, and proof of excellence matter more—whether you honed your skills in a research lab, at work, or taught yourself at 2 a.m. If that sounds like you, let’s talk.

About the Role

This is a fulltime remote role for a DevOps Lead – BareMetal & GPU Infrastructure (Linux). The successful candidate will be responsible for ensuring 99.999% service availability and optimum usagescale infrastructure ratios while shipping code across hundreds of Linux GPU servers in multiple datacenter locations.

Responsibilities

  • Fleet reliability – design and automate HA architectures that tolerate node, rack, or site failure without user impact.
  • Ultrafast delivery – build zerotouch CICD pipelines (GitOps, progressive rollout, instant rollback) that push config or container changes globally in under 10m.
  • Baremetal lifecycle – PXERedfishIPMI bootstrapping, firmware & driver orchestration, pernode GPU tuning, automated decommissioning.
  • Kubernetes on metal – multicluster controlplane HA, GPU scheduling, CNI overlay (CiliumCalico), MetalLBIngress → <50 ms failover.
  • Observability at scale – endtoend metrics, logs, traces, actionable SLO dashboards, and predictive autohealing.
  • Incident command – primary oncall lead; run blameless postmortems and automate rootcause fixes.
  • Capacity bursts – script server bringup (AnsibleTerraformClusterAPI) so 100+ new GPUs go live in minutes.
  • Security & compliance – kernellevel hardening, secrets management, GPU multitenancy isolation, continuous CVE patching.
  • Mentorship – guide a small SREDevOps pod, set coding standards, and champion best practices.
    • In Your First 12 Months You Will:

      • Cut average deployment latency to ≤ 2m endtoend, with oneclick rollbacks.
      • Maintain ≤ 5 min total annual uservisible downtime (five nines) across all sites.
      • Automate server bringup to <10 min from rack poweron to production workload.
      • Reduce P1 incidents by ≥ 60% through predictive alerting and autoremediation.
      • Deliver fully auditable, Gitcentric change pipelines adopted by 100% of engineering.
        • Requirements

          • 5+ yrs Linux SREDevOps with 100+ baremetal node fleets; 2+ yrs as technical lead.
          • Deep knowledge of NVIDIAAMD GPU servers, highspeed interconnects (40 GbE+InfiniBandRoCE), NVMeRDMA storage.
          • Proven record sustaining ≥ 99.999% uptime in latencysensitive, highvariance demand environments.
          • Expert in Kubernetes on bare metal (ClusterAPI, KubeVirt, GPU Operator), advanced CNI, custom schedulers, and etcd careandfeeding.
          • Strong skills in Go or Python, plus Bash; you write the tools you can’t find.
          • InfrastructureasCode mastery (Terraform, Ansible, Packer), GitOps workflows, and container build systems.
          • Monitoringalerting stacks (Grafana), chaoslatency testing, synthetic probes.
          • Clear architectural thinking, crisp documentation, and calm communication under pressure.
            • Ready to architect zerodowntime, subminute rollouts for thousands of GPUs? Apply and let’s run the world’s AI together.

              Benefits

              We’re a remotefirst collective, meeting in person twice a year to plan, brainstorm, celebrate wins, and enjoy some facetoface time. We have core hours for cooperative working and calls, but outside of that your calendar is yours. Work the hours that let you perform at your peak while also building a healthy life.

              Our release cycles are fast and intense, but they’re followed by real downtime. After big pushes we expect the team to unplug, recharge, and come back ready & stronger than ever for the next leap.

              • Generous paid time off – vacation, sick days, public holidays
              • Meaningful stock options – share in the upside you create
              • Remotefirst setup – work from home anywhere we can employ you
              • Flexible hours – own your schedule outside core collaboration blocks
              • Family leave – paid maternity, paternity, and caregiver time
              • Company retreats – twiceyearly gatherings in inspiring locations

Required profile

Experience

Level of experience: Senior (5-10 years)
Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Teamwork
  • Communication
  • Problem Solving

Related jobs