Key Facts

Remote From:

California (USA)

Category: Machine Learning Engineer

Full time

Senior (5-10 years)

English

Hard Skills

Multi-Cloud Advanced Distributed Learning Infrastructure as Code (IaC) Concurrency Pattern Observability Prometheus (Software) Distributed Network Protocol (DNP3) Service Integration And Management Packet Switching Data Synchronization +25 more

Other Skills

•
Detail Oriented
•
Team Oriented

Roles & Responsibilities

5+ years of work experience in infrastructure/platform engineering with multi-cloud deployments (AWS, GCP, Azure) and infrastructure-as-code (Pulumi, Terraform, or CloudFormation)
Deep experience with distributed ML infrastructure: distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, and decentralized networking (P2P, NAT traversal)
Strong systems programming and reliability skills: Python (asyncio, concurrency, cloud SDKs), observability (SRE practices, Prometheus/Grafana), performance profiling and incident response
Experience in a startup environment or fast-moving teams with microservices orchestration and a strong attention to detail

Requirements:

Design resource management systems provisioning and orchestration across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
Architect fault-tolerant infrastructure for distributed ML: GPU clusters, NVIDIA runtime, S3 checkpointing, large dataset management and streaming, health monitoring, and resilient retry strategies.
Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity.
Own core infrastructure powering the decentralized ML training platform, enabling continuous experimentation and large-scale model training; collaborate with ML researchers and maintain reliability and observability standards.

Job description

Overview

Pluralis Research is pioneering Protocol Learning—a fully decentralised way to train and deploy AI models that opens this layer to individuals rather than well resourced corporates. By pooling compute from many participants, incentivising their efforts, and preventing any single party from controlling a model’s full weights, we’re creating a genuinely open, collaborative path to frontier-scale AI.

We’re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralized ML training platform. You will own core systems spanning infrastructure orchestration, distributed compute, and services integration, enabling continuous experimentation and large-scale model training.

Responsibilities

Multi-Cloud Infrastructure: Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
Distributed Training Systems: Architect fault-tolerant infrastructure for distributed ML. GPU clusters, NVIDIA runtime, S3 checkpointing, Large dataset management and streaming, health monitoring, and resilient retry strategies.
Real-World Networking: Build systems that simulate and handle real-world network conditions — bandwidth shaping, latency injection, packet loss — while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, because our training happens on consumer nodes and non co-located infrastructure, not in a datacenter.

What You’ll Bring

Ideally, you’ll have 5+ years of work experience with deep experience in:

Infrastructure & Platform Engineering: Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments, lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale.
Distributed Systems & ML Infrastructure: Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints.
Systems Programming & Reliability: Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling) with hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response.

What we’re looking for

Experience in a startup environment with an emphasis on micro-services orchestration or big tech background
Deep understanding of multi-cloud infra & distributed training systems
A team player with high attention to detail
A strong passion to join

Backed by Union Square Ventures and other tier-1 investors, we’re a world-class, deeply technical team of ML researchers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.

Ready to apply?

APPLY

Share ·

Machine Learning Engineer Related jobs

California (USA)Machine Learning Engineer

Senior Machine Learning Engineer - Personalization

1 day ago

Spotify

Full time

EmbeddingRecommender SystemsA/B TestingAmazon Web ServicesModel Validation

AI / Machine Learning Engineer

Today

Guidehouse

Full time

Machine LearningPython (Programming Language)Data EngineeringModel ValidationPandas (Python Package)

Senior Machine Learning Engineer (RecSys)

1 day ago

Infomediji

Full time

Machine LearningPyTorch (Machine Learning Library)Python (Programming Language)Recommender SystemsMLflow

Senior Machine Learning Engineer

1 day ago

Trafilea Tech E-commerce Group

Full time

Explainable AI (XAI)XgboostForecasting ManagementLarge Language ModelingSystems Integration

Senior Machine Learning Engineer - AI Enablement (AU remote)

1 day ago

Canva

Full time

KubernetesPython (Programming Language)Cross-Functional CollaborationDebuggingSpecimen Labeling

Other jobs at Pluralis Research

Developer Relations Lead

30+ days ago

Pluralis Research

Full time
Senior (5-10 years)

Technical WritingCommunity Development PlanningDevelopment SupportKPI ReportingYouTube Channels

Research Scientist Intern

30+ days ago

Pluralis Research

Internships
1 - 1K

PyTorch (Machine Learning Library)Deep LearningDistributed Control SystemsVirtual Training

Machine Learning Engineer - Distributed ML Systems

30+ days ago

Pluralis Research

Full time
Senior (5-10 years)

Advanced Distributed LearningProgram DevelopmentMessage Transmission Optimization MechanismConcurrency PatternPerformance Profiling

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.

Machine Learning Engineer - ML Training Platform

Key Facts

Hard Skills

Other Skills

Roles & Responsibilities

Requirements:

Job description

Overview

Responsibilities

What You’ll Bring

What we’re looking for

Machine Learning Engineer Related jobs

Senior Machine Learning Engineer - Personalization

AI / Machine Learning Engineer

Senior Machine Learning Engineer (RecSys)

Senior Machine Learning Engineer

Senior Machine Learning Engineer - AI Enablement (AU remote)

Other jobs at Pluralis Research

Developer Relations Lead

Research Scientist Intern

Machine Learning Engineer - Distributed ML Systems

We help you get seen. Not ignored.

Auto-Apply

AI Match Feedback