Key Facts

Remote From:

Full time

Senior (5-10 years)

English

Hard Skills

Infrastructure as Code (IaC) Kubernetes Distributed Computing High Performance Computing Observability Site Reliability Engineering Policy Enforcement Test Harness Kernel Debuggers Identity And Access Management +23 more

Other Skills

•
Communication
•
Team Oriented
•
Growth Mindedness

Roles & Responsibilities

10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Platform Engineering
Deep experience designing and operating distributed systems at scale on cloud platforms (e.g., AWS) and Kubernetes
Strong expertise in reliability engineering practices, incident management, fault isolation, resiliency design, and performance tuning
Experience building and operating CI/CD systems, test harnesses, and automated validation frameworks

Requirements:

Architect and evolve core platform capabilities for reliability, including execution environments, CI/CD systems, and validation pipelines for high-throughput, machine-assisted change
Design and implement fast, ephemeral, strictly isolated execution environments for building, testing, and safely discarding generated work at scale
Transform CI/CD into a validation system by embedding automated verification (tests, integration harnesses, canarying, rollback signals) into promotion decisions
Build production-like validation environments that allow realistic system behavior testing without impacting live systems

Job description

Join ABC Fitness, the leading technology provider for the fitness industry!

What You’ll Do

• Architect and evolve core platform capabilities for reliability, including execution environments, CI/CD systems, and validation pipelines that support high-throughput, machine-assisted change.

• Design and implement fast, ephemeral, and strictly isolated execution environments where generated work can be built, tested, and safely discarded at scale.

• Transform CI/CD into a validation system by embedding automated verification (tests, integration harnesses, canarying, rollback signals) into promotion decisions.

• Build production-like validation environments that allow realistic system behavior testing without impacting live systems.

• Establish deep observability patterns for autonomous workflows, including tracing what ran, what failed, why, and what it cost across agents, tools, and orchestration layers.

• Define and implement guardrails-as-code, including access controls, policy enforcement, cost protections, and auditability for platform usage.

• Design for reliability from day one, including scalability, fault tolerance, performance optimization, and operational resilience.

• Lead technical design reviews and influence platform and infrastructure decisions across engineering teams.

• Define and document reusable infrastructure patterns, platform standards, and reference implementations that create a consistent paved path for teams.

What This Is Not

• Not a ticket queue or generic support role.

• Not incremental-only ops without ownership of architecture and adoption.

• Not “just Kubernetes admin”—Kubernetes is one layer in a broader platform problem.

What You’ll Need

• Typically 10+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or Platform Engineering.

• Deep experience designing and operating distributed systems at scale, including cloud platforms (e.g., AWS), Kubernetes, and infrastructure-as-code.

• Strong expertise in reliability engineering practices, including incident management, fault isolation, resiliency design, and system performance tuning.

• Experience building and operating CI/CD systems, test harnesses, and automated validation frameworks.

• Strong understanding of observability systems, including metrics, logging, tracing, and system-level debugging.

• Demonstrated ability to define technical standards and influence multiple teams through architecture, design review, and strong engineering judgment.

• Strong production mindset, with experience designing systems for scalability, availability, and operational efficiency.

• Experience implementing secure, multi-tenant infrastructure with strong isolation, IAM, and secrets management practices.

• Excellent cross-functional collaboration skills.

• Growth mindset and One Team orientation.

And It’s Great to Have

• Experience supporting AI/LLM-powered systems in production, including understanding of latency, cost, and orchestration challenges.

• Experience designing high-throughput ephemeral compute systems or sandboxed execution environments.

• Experience building internal developer platforms or platform-as-a-product capabilities.

• Familiarity with governance or regulated environments.

• Experience with advanced validation systems such as canarying, chaos engineering, or automated rollback strategies.

What Success Looks Like

• Faster delivery through platform-enabled validation and automation.

• Automated validation of changes before production, reducing reliance on manual review.

• Platform standards adopted across teams as the default paved path.

• Early detection of reliability issues through strong observability and validation systems.

• Reduced infrastructure complexity so engineers can focus on product and policy.

Why This Matters

ABC Fitness is evolving toward an AI-native engineering model where automation, agents, and platform systems handle increasing portions of the software lifecycle. This role builds the foundation that enables scalable, trustworthy, and high-velocity software delivery across the organization.

If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

Ready to apply?

APPLY

Share ·