Key Facts

Remote From:

California (USA)

Category: Machine Learning Engineer

Full time

Senior (5-10 years)

English

Hard Skills

Advanced Distributed Learning Program Development Message Transmission Optimization Mechanism Concurrency Pattern Performance Profiling System Monitoring Data Synchronization Python (Programming Language) Distributed Computing IP Routing +8 more

Roles & Responsibilities

Strong experience building and operating distributed systems in production
Hands-on expertise with distributed training frameworks (e.g., FSDP, DeepSpeed, Megatron) or similar
Deep understanding of model parallelism (data, tensor, and pipeline parallelism)
Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture)

Requirements:

Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques to minimize communication overhead.
Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs; build monitoring and metrics to track training progress, model quality, and system bottlenecks.
Architect resilient training systems with decentralized networking, including peer-to-peer topologies, NAT traversal, peer discovery, dynamic routing, and efficient communication despite participant churn.

Job description

Overview

Pluralis Research carries out foundational research on Protocol Learning: multi-participant training of foundation models where no single participant has, or can ever obtain, a full copy of the model. The purpose of Protocol Learning is to facilitate the creation of community-trained and community-owned frontier models with self-sustaining economics.

We're looking for Senior/Staff engineers with 5+ years of experience in distributed systems and ML large-scale training. You'll be implementing a novel substrate for training distributed ML models that work under consumer grade internet connection.

Responsibilities

Distributed Training Architecture & Optimization

Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions.
Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead.
Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes.
Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs.
Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks.

Decentralized Networking & Resilience

Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave.
Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes.
Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management.
Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

What You’ll Bring

Strong experience building and operating distributed systems in production.
Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar).
Deep understanding of model parallelism (data, tensor, pipeline parallelism).
Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture).
Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination.
Experience optimizing GPU workloads, memory management, and large-scale compute efficiency.

What We Offer

Equity-heavy compensation with meaningful ownership in a mission-driven company
Competitive base salary for senior engineering roles in Australia
Visa sponsorship available for exceptional candidates
Remote-first with optional access to our Melbourne hub
World-class team — team mates were previously at at Google, Amazon, Microsoft, and leading startups

Backed by Union Square Ventures and other tier-1 investors, we're a world-class, deeply technical team of ML researchers and engineers. Pluralis is unapologetically ideological. We view the world as a better place if we are able to implement what we are attempting, and Protocol Learning as the only plausible approach to preventing a handful of massive corporations monopolising model development, access and release, and achieving massive economic capture. If this resonates, please apply.

Ready to apply?

APPLY

Share ·

Machine Learning Engineer Related jobs

California (USA)Machine Learning Engineer

Principal Machine Learning Engineer

1 day ago

HubSpot

Full time

Machine LearningArtificial IntelligenceDeep LearningModel BuildingModel Validation

Sr. Machine Learning Engineer, Core Engineering

3 days ago

Full time

Machine LearningDeep LearningData ProcessingRecommender SystemsNatural Language Processing (NLP)

Machine Learning Engineer, Ads Optimization & Ads Marketplace Quality

1 day ago

Full time

Machine LearningEngineering OptimizationComputer ProgrammingData ProcessingPython (Programming Language)

Machine Learning Engineer III, Routing Cost

3 days ago

Mapbox

Full time

Machine LearningPython (Programming Language)Distributed ComputingApache HadoopApache Airflow

Machine Learning Engineer

3 days ago

Elicit

Full time

Machine LearningSoftware EngineeringLanguage ModelData IntegrationEvaluation Projects

Other jobs at Pluralis Research

Machine Learning Engineer - ML Training Platform

30+ days ago

Pluralis Research

Full time
Senior (5-10 years)

Multi-CloudAdvanced Distributed LearningInfrastructure as Code (IaC)Concurrency PatternObservability

Research Scientist Intern

30+ days ago

Pluralis Research

Internships
1 - 1K

PyTorch (Machine Learning Library)Deep LearningDistributed Control SystemsVirtual Training

BizOps Associate

27 days ago

Pluralis Research

Full time
Junior (1-2 years)

Go-to-Market StrategyOperational ExcellenceVenture CapitalModel BuildingData Synthesis

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.

Machine Learning Engineer - Distributed ML Systems

Key Facts

Hard Skills

Roles & Responsibilities

Requirements:

Job description

Overview

Responsibilities

Distributed Training Architecture & Optimization

Decentralized Networking & Resilience

What You’ll Bring

What We Offer

Machine Learning Engineer Related jobs

Principal Machine Learning Engineer

Sr. Machine Learning Engineer, Core Engineering

Machine Learning Engineer, Ads Optimization & Ads Marketplace Quality

Machine Learning Engineer III, Routing Cost

Machine Learning Engineer

Other jobs at Pluralis Research

Machine Learning Engineer - ML Training Platform

Research Scientist Intern

BizOps Associate

We help you get seen. Not ignored.

Auto-Apply

AI Match Feedback