Key Facts

Remote From:

Full time

English

Hard Skills

Convex Optimization PyTorch (Machine Learning Library) Advanced Distributed Learning Performance Systems Analysis Parallel Computing Pipeline Pigging Engineering Optimization Thread Pool Pattern Performance Management System Optimization Training Analysis Profiling (Computer Programming) Bottleneck Analysis Fault Tolerance Batch Files GPU Optimization DeepSpeech Property Ownership Quality Assessment

Other Skills

•
Training And Development
•
Collaboration

Job description

About the Role

We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.

This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.

What You’ll Do

Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
Improve distributed training strategies (data, model, and pipeline parallelism)
Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
Collaborate with researchers on architecture-aware training strategies
Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
Own training performance metrics and continuously push them forward

What We’re Looking For

Strong experience training large neural networks (LLMs or similarly large models)
Hands-on experience with training optimization (not just model usage)
Solid understanding of:
- Backpropagation, optimization algorithms, and training dynamics
- Distributed systems for ML training
Experience with PyTorch (required)
Comfort working close to hardware (GPUs, memory, networking constraints)
Ability to move fluidly between research ideas and production-ready code

Nice to Have

Experience with large-scale distributed training (multi-node, multi-GPU)
Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
Experience optimizing training on AMD or NVIDIA GPUs
Contributions to open-source ML infrastructure or research codebases
Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Why Join Us

Real ownership at Series-A stage — your work shapes the company’s trajectory
Work on cutting-edge models and training systems at scale
Small, highly technical team with fast feedback loops
Strong emphasis on engineering quality and research rigor
Competitive compensation + meaningful equity

Ready to apply?

APPLY

Share ·

Machine Learning Engineer Related jobs

Worldwide Machine Learning Engineer

Senior Machine Learning Engineer I

30+ days ago

Parexel

Full time

Natural Language Processing (NLP)Machine LearningDeep LearningPython (Programming Language)Data Structures

Senior Staff Engineer, Machine Learning

30+ days ago

Nagarro

Full time

Machine LearningPython (Programming Language)KubernetesProof Of Concept (POC) DevelopmentRoot Cause Analysis

Senior Data Engineer- AI/ML (Remote)

30+ days ago

Ad Hoc LLC

Fixed term

MLOps (Machine Learning Operations)PyTorch (Machine Learning Library)Python (Programming Language)EmbeddingMLflow

Staff Software Engineer, Machine Learning Infrastructure

30+ days ago

Clarifai

Full time

Lifecycle ManagementScalabilityOpen Source DevelopmentDev TestingPerformance Improvement

Machine Learning Engineer II

30+ days ago

Parexel

Full time

Natural Language Processing (NLP)Machine LearningPython (Programming Language)Deep LearningData Structures

Other jobs at Featherless AI

Senior Software Engineer - API Gateway

30+ days ago

Featherless AI

Full time

Node.js (Javascript Library)Application Programming Interface (API)KubernetesObservabilityApplication Programming Interface (API)

Developer Relations Associate/Intern (Partnerships) Boston-Based

30+ days ago

Featherless AI

Internships
120 - 120K

JavaScript (Programming Language)API TestingPython (Programming Language)EcologyCloud Computing

Developer Relations (DevRel)

30+ days ago

Featherless AI

Full time
Senior (5-10 years)
250 - 250K

Large Language ModelingCommunity DesignDevelopment SupportCustomer Success ManagementBusiness Analysis

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

✨

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.

Machine Learning Engineer — Training Optimization

Key Facts

Hard Skills

Other Skills

Job description

About the Role

What You’ll Do

What We’re Looking For

Nice to Have

Why Join Us

Machine Learning Engineer Related jobs

Senior Machine Learning Engineer I

Senior Staff Engineer, Machine Learning

Senior Data Engineer- AI/ML (Remote)

Staff Software Engineer, Machine Learning Infrastructure

Machine Learning Engineer II

Other jobs at Featherless AI

Senior Software Engineer - API Gateway

Developer Relations Associate/Intern (Partnerships) Boston-Based

Developer Relations (DevRel)

We help you get seen. Not ignored.

Auto-Apply

AI Match Feedback