You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including provisioning, health monitoring, error detection, and recovery β and when something breaks that automation canβt fix, you drive resolution with partners.
San Francisco, CA (we are open to remote in the US for Senior and Staff levels)

Sezzle

Grafana Labs

FOSSA

PolicyMe

Broadsign

fal