Match score not available

DevOps Lead BareMetal & GPU Infrastructure (Linux)

Work set-up:

Full Remote

Contract:

Experience:

Senior (5-10 years)

Work from:

Offer summary

Qualifications:

5+ years of Linux SRE/DevOps experience managing 100+ bare-metal nodes., Deep knowledge of NVIDIA/AMD GPU servers and high-speed interconnects., Proven ability to maintain ≥ 99.999% uptime in latency-sensitive environments., Expertise in Kubernetes on bare metal, Infrastructure-as-Code, and monitoring tools..

Key responsibilities:

Design and automate high-availability architectures for fleet reliability.
Build and maintain zero-touch CI/CD pipelines for rapid deployment.
Manage bare-metal lifecycle, including firmware, drivers, and GPU tuning.
Lead incident response and guide a small DevOps team.

Runware

2 - 10 Employees

Job description

Company Description

Runware is the fastest AIasaService platform for media generation
Runware is an AIasaService platform that delivers realtime inference at 5–10× lower cost than competitors. Our platform is purposebuilt for speed & efficiency: custom GPU design, server setup, and datacenter architecture matched with performanceoptimized software and a bestinclass API. Engineering teams who work with Runware save up to 80% on inference, improve response times, and scale instantly across 300K+ AI models, all through a single flexible API. Usagebased pricing and ondemand capacity are already battletested by Wix, OpenArt, NightCafe, Freepik, and thousands more. Backed by Insight Partners, a16z Speedrun, Begin Capital, and Zero Prime.
Join Runware to power the AI products that are changing the world
At Runware you’ll collaborate with the world’s leading AI teams, turning cuttingedge research into breakthrough products for thousands of clients. New models hit the market every week, and our job isn’t just to keep pace—it’s to stay two steps ahead, delivering unbeatable speed and performance every time.
That takes a special kind of teammate: driven, selfdirected, lightningquick to learn, and rocksolid reliable. If you thrive on building ambitious things with people who work hard, care for one another, and refuse to settle for “good enough,” you’ll feel right at home.
Resumés matter, but passion, grit, and proof of excellence matter more—whether you honed your skills in a research lab, at work, or taught yourself at 2 a.m. If that sounds like you, let’s talk.
About the Role
This is a fulltime remote role for a DevOps Lead – BareMetal & GPU Infrastructure (Linux). The successful candidate will be responsible for ensuring 99.999% service availability and optimum usagescale infrastructure ratios while shipping code across hundreds of Linux GPU servers in multiple datacenter locations.
Responsibilities

Fleet reliability – design and automate HA architectures that tolerate node, rack, or site failure without user impact.

Ultrafast delivery – build zerotouch CICD pipelines (GitOps, progressive rollout, instant rollback) that push config or container changes globally in under 10m.

Baremetal lifecycle – PXERedfishIPMI bootstrapping, firmware & driver orchestration, pernode GPU tuning, automated decommissioning.

Kubernetes on metal – multicluster controlplane HA, GPU scheduling, CNI overlay (CiliumCalico), MetalLBIngress → <50 ms failover.

Observability at scale – endtoend metrics, logs, traces, actionable SLO dashboards, and predictive autohealing.

Incident command – primary oncall lead; run blameless postmortems and automate rootcause fixes.

Capacity bursts – script server bringup (AnsibleTerraformClusterAPI) so 100+ new GPUs go live in minutes.

Security & compliance – kernellevel hardening, secrets management, GPU multitenancy isolation, continuous CVE patching.

Mentorship – guide a small SREDevOps pod, set coding standards, and champion best practices.

In Your First 12 Months You Will:

Cut average deployment latency to ≤ 2m endtoend, with oneclick rollbacks.

Maintain ≤ 5 min total annual uservisible downtime (five nines) across all sites.

Automate server bringup to <10 min from rack poweron to production workload.

Reduce P1 incidents by ≥ 60% through predictive alerting and autoremediation.

Deliver fully auditable, Gitcentric change pipelines adopted by 100% of engineering.

Requirements

5+ yrs Linux SREDevOps with 100+ baremetal node fleets; 2+ yrs as technical lead.

Deep knowledge of NVIDIAAMD GPU servers, highspeed interconnects (40 GbE+InfiniBandRoCE), NVMeRDMA storage.

Proven record sustaining ≥ 99.999% uptime in latencysensitive, highvariance demand environments.

Expert in Kubernetes on bare metal (ClusterAPI, KubeVirt, GPU Operator), advanced CNI, custom schedulers, and etcd careandfeeding.

Strong skills in Go or Python, plus Bash; you write the tools you can’t find.

InfrastructureasCode mastery (Terraform, Ansible, Packer), GitOps workflows, and container build systems.

Monitoringalerting stacks (Grafana), chaoslatency testing, synthetic probes.

Clear architectural thinking, crisp documentation, and calm communication under pressure.

Ready to architect zerodowntime, subminute rollouts for thousands of GPUs? Apply and let’s run the world’s AI together.
Benefits
We’re a remotefirst collective, meeting in person twice a year to plan, brainstorm, celebrate wins, and enjoy some facetoface time. We have core hours for cooperative working and calls, but outside of that your calendar is yours. Work the hours that let you perform at your peak while also building a healthy life.
Our release cycles are fast and intense, but they’re followed by real downtime. After big pushes we expect the team to unplug, recharge, and come back ready & stronger than ever for the next leap.

Generous paid time off – vacation, sick days, public holidays

Meaningful stock options – share in the upside you create

Remotefirst setup – work from home anywhere we can employ you

Flexible hours – own your schedule outside core collaboration blocks

Family leave – paid maternity, paternity, and caregiver time

Company retreats – twiceyearly gatherings in inspiring locations

Required profile

Experience

Level of experience: Senior (5-10 years)

Spoken language(s):

English

Check out the description to know which languages are mandatory.

Hard Skills

Kubernetes Linux Graphics Processing Unit (GPU)Python (Programming Language)Incident Management Nvidia CUDA Grafana Ansible Amd Processor Bash (Scripting Language)Infrastructure as Code (IaC)Continuous Monitoring Go (Programming Language)Terraform Tealium Performance Analysis Chaos Engineering Internal Documentation

Other Skills

Teamwork
Communication
Problem Solving

Are you interested?

Share

Related jobs

Senior Software Engineer, Frontend - Developer Experience

Senior Software Engineer, Frontend - Developer Experience

Senior Software Engineer, Frontend - Developer Experience

4 days ago

Coinbase

Full time

Front End DesignJavaScript Libraries

Global Category Manager - Flow, E&I

Global Category Manager - Flow, E&I

Global Category Manager - Flow, E&I

30+ days ago

Veolia Water Technologies

Full time

Talent Pool Business Development Manager (Taiwan)

Talent Pool Business Development Manager (Taiwan)

Talent Pool Business Development Manager (Taiwan)

4 days ago

Slasify

Full time

Microsoft NetworkingBusiness DevelopmentMicrosoft Dynamics CRM

Entry Level Sales Representative - Work From Home

Entry Level Sales Representative - Work From Home

Entry Level Sales Representative - Work From Home

30+ days ago

American Income Life Insurance Company

Part time

Work From Home Sales Representative (No Experience Needed)

Work From Home Sales Representative (No Experience Needed)

Work From Home Sales Representative (No Experience Needed)

4 days ago

The Summers Agency

Part time

Client EducationTerm Life InsuranceInsurance Policies