Offer summary

Qualifications:

Minimum 8 years of experience as an infrastructure engineer or DevOps in large-scale distributed systems., Deep understanding of networking, with bonus points for HPC networking experience., Proficiency in developing high-quality software using general-purpose programming languages, preferably Python., Strong problem-solving skills, attention to detail, and experience with GPUs in large clusters..

Key responsibilities:

Manage and optimize training HPC clusters from provisioning to performance tuning.

Work closely with machine learning researchers to enable large-scale AI computations.

Implement observability, distributed job tracing, and GPU diagnostics tools.

Troubleshoot hardware and network failures in distributed systems.

Job description

Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a HighPerformance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running largescale AI models. We work at the very cutting edge of speed and scale, combining the traditions of HighPerformance Computing (HPC) in a modern cloud environment.

For this role, it’s important you understand how to combine CPU’s, GPU’s, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and userspace code. You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

In this role, you will work closely with and directly accelerate machine learning researchers, but dont need to be a machine learning expert yourself.
We value people who can quickly obtain a deep technical understanding of new domains and enjoy being selfdirected and identifying the most important problems to solve.
You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.
Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.
We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.