onsite

Senior HPC Support Engineer - Compute and GPU Platform - NVIDIA

Software Engineer

Senior engineer providing expert support for high‑performance computing and GPU‑accelerated platforms, troubleshooting Linux clusters, optimizing CUDA workloads, and collaborating with developers to deliver reliable AI and scientific computing solutions.

About the role

Key Responsibilities

Provide tier‑2/3 technical support for Linux‑based HPC clusters and GPU‑accelerated systems, including installation, configuration, and performance tuning.
Diagnose and resolve complex issues in CUDA applications, drivers, and runtime environments for AI, scientific, and rendering workloads.
Maintain and optimize job schedulers (e.g., SLURM, PBS) and resource managers to ensure efficient utilization of compute and GPU resources.
Develop and maintain automation scripts (Python, Bash) for monitoring, diagnostics, and routine maintenance tasks.
Collaborate with hardware and software engineering teams to reproduce bugs, provide detailed logs, and drive root‑cause analysis.
Document support procedures, best‑practice guides, and knowledge‑base articles for internal and external stakeholders.

Requirements

5+ years of hands‑on experience supporting Linux HPC clusters and GPU‑focused compute environments.
Strong proficiency in CUDA development, driver stack, and performance profiling tools.
Expertise with job schedulers (SLURM, PBS, LSF) and cluster management tools.
Solid scripting skills in Python and Bash for automation and troubleshooting.
Excellent problem‑solving communication skills and ability to work across distributed, cross‑functional teams.

Skills

linuxcudapython

CompanyNVIDIA

DepartmentSupport

LocationSanta Clara, California, United States

Experience5+ years

Tenurefull-time

LevelSenior

Salary207,000

Posted June 25, 2026