You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery β and when something breaks that automation canβt fix, you drive resolution with partners.
San Francisco, CA (we are open to remote in the US for Senior and Staff levels)

Airbnb

Thrive POS

MedStar Health

accesa.eu

Reliable Robotics Corporation

fal