Operated production infrastructure at meaningful scale
Strong in practical DevOps execution and operational reliability
Experience with distributed systems or GPU-oriented workloads
Focus on automation, observability, and deployment safety
Requirements:
Improve CI/CD pipelines, deployment workflows, and release reliability
Standardize infrastructure and deployment patterns across environments
Improve observability through logging, metrics, tracing, and monitoring
Support ML-oriented infrastructure including SageMaker workloads and GPU scaling patterns
Job description
We're looking for a strong DevOps engineer who can help scale and operationalize our infrastructure as the platform grows. This is not a pure platform-architecture role — the focus is CI/CD, infrastructure automation, deployment reliability, observability, and GPU-oriented workload scaling. What You'll Own
Improve CI/CD pipelines, deployment workflows, and release reliability
Standardize infrastructure and deployment patterns across environments
Improve observability through logging, metrics, tracing, dashboards, and rollout monitoring
Partner closely with backend engineering on:
deployment strategies
infrastructure automation
environment consistency
migration workflows
possible Kubernetes migration efforts
Support ML-oriented infrastructure as a secondary responsibility:
SageMaker workloads
Ray clusters
GPU scaling patterns
distributed batch execution
autoscaling behavior
runtime/image management
artifact delivery/versioning
The Kind of Problems You'll Work On
Deployment safety and rollback strategies
Infrastructure consistency across environments
Release automation and environment promotion flows
Autoscaling and runtime stability
GPU workload orchestration and scaling efficiency
Operational tooling that reduces friction for engineering teams
Stack
AWS
Terraform
Docker
Kubernetes
CI/CD systems
SageMaker
Ray
GPU compute infrastructure
You'll Probably Do Well Here If
You've operated production infrastructure at meaningful scale
You're strong in practical DevOps execution and operational reliability
You care about automation, observability, and deployment safety
You're comfortable improving developer workflows and infrastructure tooling
You've worked with distributed systems or GPU-oriented workloads before
From: Arizona (USA), District of Columbia (USA), Florida (USA), Illinois (USA), Kansas (USA), Missouri (USA), New York (USA), Oregon (USA), Washington (USA) (Full Remote)
From: Arizona (USA), District of Columbia (USA), Florida (USA), Illinois (USA), Kansas (USA), Missouri (USA), New York (USA), Oregon (USA), Washington (USA) (Full Remote)