Logo for NVIDIA

Senior Solutions Architect - AI Factory Deployment

Roles & Responsibilities

  • Bachelor's degree or equivalent in Computer Science, Mathematics, Engineering, Physics, or related field.
  • 6+ years of experience managing Linux-based HPC/distributed systems or AI/ML environments, with hands-on workloads on multi-GPU/multi-node clusters.
  • Deep knowledge of collective communication patterns (AllReduce and AllToAll) and practical experience with NCCL in ML/LLM training workflows (PyTorch or TensorFlow).
  • Proficiency in Python and Shell/Bash scripting, plus experience with benchmarking and observability tooling (metrics, logs, dashboards) to automate and monitor performance.

Requirements:

  • Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters, ensuring NCCL/collectives and distributed training framework configurations.
  • Own execution of key AI/LLM benchmarks: setup, orchestration, result collection, analysis, and issue resolution for underperforming or failed jobs.
  • Build and improve observability for AI factories, including metrics, logs, traces, and dashboards; develop automation for benchmarking and regression checks.
  • Collaborate across hardware, software, networking, data center, and product teams to prepare AI factories for customer use; contribute to documentation and readiness collateral.

Job description

We are seeking an ambitious Senior Solutions Architect - AI Factory Deployment to join our NVIDIA Infrastructure Specialists team in Santa Clara! This role is uniquely positioned to develop, deploy, and validate AI factories end to end. You will focus on running and debugging AI/LLM workloads and benchmarks on Linux-based GPU clusters, using NCCL and collectives like AllReduce and AllToAll to improve performance and scalability.

As part of our world-class team, you will bring to bear observability and automation to improve benchmarks and validation. You will serve as the expert when workloads or benchmarks do not perform flawlessly. You will collaborate across NVIDIA to ensure AI factories are prepared for customers, validating hardware and software for modern AI deployments.

What You Will be Doing:

  • Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.

  • Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks.

  • Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.

  • Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.

  • Build and improve observability for AI factories (metrics, logs, traces, dashboards) to understand workload behavior and system health.

  • Develop automation (Python, Shell) for running benchmarks, collecting results, and performing regression checks

  • Examine communication patterns and NCCL usage for AI/LLM workloads, concentrating on collectives such as AllReduce and AllToAll.

  • Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.

  • Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use.

  • Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.

What We Need to See:

  • Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or related field.

  • More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings.

  • Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL.

  • Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and how they are applied in contemporary ML/LLM training.

  • Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.

  • Proficiency with Python and Shell/Bash for scripting, automation, and tooling.

  • Experience with benchmarking (crafting, executing, and interpreting performance benchmarks).

  • Comfortable working with observability data (metrics, logs, dashboards) to troubleshoot and optimize complex distributed workloads.

  • Strong communication skills and the ability to work effectively with cross-functional teams.

Ways to Stand Out From the Crowd:

  • Experience with AI factory or large-scale AI infrastructure build, deployment, or operations.

  • Background in HPC performance engineering, SRE, or systems performance analysis for GPU-accelerated environments.

  • Familiarity with observability stacks (e.g., metrics/monitoring, logging, tracing systems) used for large distributed systems.

  • Experience building automation and CI-style pipelines for running and validating benchmarks at scale.

  • Demonstrated desire to use AI to solve practical problems, improve workflows, and guide data-driven decisions.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until May 3, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Solutions Architect Related jobs

Other jobs at NVIDIA

We help you get seen. Not ignored.

We help you get seen faster — by the right people.

🚀

Auto-Apply

We apply for you — automatically and instantly.

Save time, skip forms, and stay on top of every opportunity. Because you can't get seen if you're not in the race.

AI Match Feedback

Know your real match before you apply.

Get a detailed AI assessment of your profile against each job posting. Because getting seen starts with passing the filters.

Upgrade to Premium. Apply smarter and get noticed.

Upgrade to Premium

Join thousands of professionals who got noticed and hired faster.