Offer summary

Qualifications:

Pursuing a MS or PhD in Computer Science, Artificial Intelligence, Electrical Engineering, or a related field., Hands-on experience with large language model (LLM) training and/or inference frameworks., Strong proficiency in PyTorch and Python programming., Solid understanding of Transformer architectures, model quantization techniques, and distributed training paradigms..

Key responsibilities:

Drive LLM efficiency by designing and leveraging low-precision quantization techniques for customer deployments.

Simulate, optimize, and extend training and inference frameworks within NVIDIA's ecosystem.

Integrate and validate next-generation LLM architectures and features within core frameworks.

Collaborate with engineering teams to translate customer requirements into high-impact LLM engineering implementations.

Job description

At NVIDIA, we're pioneering transformative AI technologies that reshape industries globally. Join our dynamic, forward-thinking team working directly at the bleeding edge of Large Language Models (LLMs). This internship offers a unique opportunity to contribute to customer-accelerated deployments where your engineering skills directly impact leading enterprises. You'll collaborate closely with senior engineers and hardware experts to push the boundaries of efficiency and capability in LLM training and inference. Gain unparalleled experience applying your knowledge to real-world challenges on NVIDIA's industry-defining Tensor Core GPUs and full-stack AI platforms. Stand shoulder-to-shoulder with world-class researchers and engineers solving the next generation of AI scale and speed.

What you’ll be doing

Drive LLM efficiency: Design and leverage advanced low-precision quantization techniques (INT8, FP8, FP4) to optimize inference performance for customer deployments.
Innovate with frameworks: Simulate, optimize, and extend cutting-edge training & inference frameworks (e.g., vLLM, SGLang, TensorRT-LLM, NeMo, Megatron) within NVIDIA's ecosystem.
Enable new AI capabilities: Integrate and validate next-generation LLM architectures and features within core frameworks to expand NVIDIA's solution offerings.
Tune for peak performance: Conduct rigorous performance analysis and tuning of LLM workloads for optimal execution on cloud and on-premises NVIDIA platforms.
Collaborate on customer solutions: Partner with engineering teams and solution architects to translate customer requirements into high-impact LLM engineering implementations.

What we need to see

Pursuing a MS or PhD in Computer Science, Artificial Intelligence, Electrical Engineering, or a related field.
Hands-on experience with large language model (LLM) training and/or inference frameworks from project work, research, or prior internships.
Strong proficiency in PyTorch and Python programming.
Solid foundational understanding of:
- Transformer architectures & core LLM algorithms.
- Principles and trade-offs of model quantization techniques.
- Distributed training paradigms (e.g., FSDP, ZeRO, 3D/5D parallelism, RLHF infrastructure).
A link to your GitHub profile or code samples is required with your application (demonstrating relevant projects).

Ways to stand out from the crowd

Demonstrable experience with quantization tools and workflows (e.g., GPTQ, AWQ, SmoothQuant).
Contributions to relevant Open Source Software projects (e.g., vLLM, SGLang, Hugging Face Transformers, PyTorch, DeepSpeed).
Understanding of GPU architecture (CUDA), high-performance computing concepts, and cluster communication libraries (e.g., NCCL, MPI).
Record of published research in machine learning, NLP, or systems at major conferences/journals.
Experience deploying or optimizing workloads on NVIDIA GPUs and familiarity with NVIDIA AI software stacks.