Research Engineer Performance Optimization

Work set-up: 
Full Remote
Contract: 
Salary: 
180 - 180K yearly
Work from: 

Offer summary

Qualifications:

Strong problem-solving skills in PyTorch, CUDA, and distributed systems., Experience training large models using Python and PyTorch, including data processing and inference., Proficiency in profiling and optimizing CPU and GPU code, with familiarity in tools like Nvidia Nsight., Knowledge of high-performance parallel C++, Triton, and writing custom PyTorch kernels..

Key responsibilities:

  • Implement efficient models and systems for data processing, training, and deployment.
  • Optimize and troubleshoot system performance bottlenecks in memory, speed, and utilization.
  • Collaborate with research teams to ensure system efficiency from start to finish.
  • Develop tools for dataset visualization, evaluation, and filtering.

Luma AI logo
Luma AI https://lumalabs.ai/dream-machine
11 - 50 Employees
See all jobs

Job description

We are looking for engineers with significant problem solving experience in PyTorch, CUDA and distributed systems. You will work with Research Scientists to build & train cutting edge foundation models on thousands of GPUs.

Responsibilities

  • Ensure efficient implementation of models & systems for data processing, training, inference and deployment

  • Identify and implement optimization techniques for massively parallel and distributed systems

  • Identify and remedy efficiency bottlenecks (memory, speed, utilization) by profiling and implementing highperformance CUDA, Triton, C++ and PyTorch code

  • Work closely together with the research team to ensure systems are planned to be as efficient as possible from start to finish

  • Build tools to visualize, evaluate and filter datasets

  • Implement cuttingedge product prototypes based on multimodal generative AI

    • Experience

      • Experience training large models using Python & Pytorch, including practical experience working with the entire development pipeline from data processing, preparation & data loading to training and inference.

      • Experience optimizing and deploying inference workloads for throughput and latency across the stack (inputs, model inference, outputs, parallel processing etc.)

      • Experience with profiling CPU & GPU code in PyTorch, including Nvidia Nsight or similar.

      • Experience writing & improving highly parallel & distributed PyTorch code, with familiarity in DDP, FSDP, Tensor Parallel, etc.

      • Experience writing highperformance parallel C++. Bonus if done within an ML context with PyTorch, like for data loading, data processing, inference code.

      • Experience with highperformance Triton CUDA and writing custom PyTorch kernels. Top candidates will be able to utilize tensor cores; optimize performance with CUDA memory and other similar skills.

      • Good to have experience working with Deep learning concepts such as Transformers & Multimodal Generative models such as Diffusion Models and GANs.

      • Good to have experience building inference demo prototype code (incl. Gradio, Docker etc.)


        • Compensation

          • The pay range for this position in California is $180,000 $250,000yr; however, base pay offered may vary depending on jobrelated knowledge, skills, candidate location, and experience. We also offer competitive equity packages in the form of stock options and a comprehensive benefits plan.

            • Your applications are reviewed by real people.

Required profile

Experience

Spoken language(s):
English
Check out the description to know which languages are mandatory.

Other Skills

  • Problem Solving

Related jobs