Key Facts

Remote From:

Full time

Senior (5-10 years)

English

Hard Skills

Large Language Modeling Advanced Distributed Learning Software Engineering Distributed Computing Developing Training Materials Bottleneck Analysis Model Validation Infrastructure Management Kernel Debuggers Data Curation High Performance Computing Engineering Research Technical Leadership

Other Skills

•
Training And Development
•
Adaptability
•
Teamwork
•
Quick Learning
•
Problem Solving

Roles & Responsibilities

Proven track record in ML engineering with experience in training or post-training/deployment of large language models; preference for post-training expertise.
Strong experience with ML distributed training and scaling on GPU clusters (NVIDIA/AMD) and multi-node environments, with ability to diagnose GPU/kernel and memory/storage bottlenecks.
Proficiency in building reliable, scalable software stacks for large-scale LLM development, including tooling, monitoring, and observability.
Demonstrated technical leadership, problem-solving, collaboration, and engineering discipline; ability to adapt across HPC configurations and architectures.

Requirements:

Design, implement and optimize core components across data curation, evals, pre-training, and post-training for frontier model development.
Build, maintain and scale the training software stack for SOTA LLMs, including infrastructure, tooling, monitoring and observability, and multi-node GPU training.
Diagnose and resolve GPU/kernel issues, memory/storage bottlenecks and training instabilities; adapt to different HPC configurations and GPU architectures (AMD/NVIDIA).
Open-source release of models and integration into Flower Lab products, while providing technical leadership and collaborating in a fast-paced startup environment.

Job description

Do you want to push the boundaries of what frontier AI models can be? Join as one of the founding members of the Flower Frontier Model Team, a new group at Flower Labs charged with building category-defining models that blend the bleeding-edge in existing practices together with Flower’s pioneering decentralized learning methods. This is a fundamentally different direction than the one vanilla frontier labs are taking, one that not only eases the path to GPU scaling but also unlocks new data silos currently unable to be leveraged for frontier model training.

We will ship models with superhuman capabilities in domains spanning science, health, finance, drug discovery, and more. This is an opportunity to help invent and build the training paradigms that will define the next decade of AI, and to work on technologies that others will study, emulate, and build upon.

About the Role

(Preference given to candidates with post-training expertise. But any talented individual with a track record of exceptional drive and determination are encouraged to apply regardless of prior experience.)

As a founding ML Engineer in this new team, you will play a critical role in building SOTA LLMs and foundation models within a small, high-impact team composed of contributors that have a mix of both research and engineering backgrounds. This role combines fast-paced development with disciplined software engineering: you will help build a reliable, maintainable and scalable software stack and use this to produce world-leading models that are open-sourced and integrated into new Flower Lab products.

You will design, implement and optimize core components across the full spectrum of stages relevant to frontier model building: data curation, evals, pre-training, post-training — everything is in scope as the team seeks to release its first series of models. Experience in these areas is obviously welcome, but a general expectation of problem solving, learning on the job and working collaboratively to efficiently combine the talents of the team is an explicit requirement for success. Familiarity with ML distributed and scaling strategies will be essential, as will experience working with GPU clusters (or similar) for multi-node training. You will diagnose and resolve GPU/kernel issues, memory/storage bottlenecks, and multi-node failures at scale — and collaborate on the debugging of training instabilities and related issues. Ability to adapt to different HPC configurations and GPU architectures (e.g., AMD/NVIDIA) will be a big plus. You will also devise surrounding infrastructure, tooling, monitoring, and observability, all essential for large-scale LLM development.

This is a foundational role for an ambitious technical effort. We are looking for a special talent that brings strong engineering discipline to the team, and has the ability to assume technical leadership as the training system scales in complexity and capability. More broadly, you can expect a collaborative, fast-paced and demanding start-up environment containing a team of experts in their respective fields, in which everyone still learns something new every day. You will have the opportunity to contribute ideas, be heard and influence the direction of the company across the board.

About the Company

Flower Labs is the world-class AI startup best known for being behind the most popular open-source framework in the world for training AI on distributed data and compute resources using decentralized and federated methods. Trusted by industry leaders such as Mozilla, JP Morgan, Owkin, Banking Circle and Temenos use Flower to easily improve their AI models on sensitive data that is distributed across organizational silos or user devices. In a world where most AI relies on centralized public datasets — just a fraction of the data available — we believe unlocking access to (orders of magnitude more) sensitive data will drive the next breakthroughs in artificial intelligence.

Flower Labs is a Y Combinator (YCW23) graduate and backed by top-tier investors and renowned angels, including Felicis, First Spark Ventures, Mozilla Ventures, Hugging Face CEO Clem Delangue, GitHub Co-Founder Scott Chacon, Factorial Capital, Betaworks, and Pioneer Fund. Together, we are redefining how AI is built, deployed, and scaled.