onsite
Senior HPC Support Engineer - Compute and GPU Platform - NVIDIA
Software Engineer
Senior engineer providing expert support for high‑performance computing and GPU‑accelerated platforms, troubleshooting Linux clusters, optimizing CUDA workloads, and collaborating with developers to deliver reliable AI and scientific computing solutions.
About the role
Key Responsibilities
- Provide tier‑2/3 technical support for Linux‑based HPC clusters and GPU‑accelerated systems, including installation, configuration, and performance tuning.
- Diagnose and resolve complex issues in CUDA applications, drivers, and runtime environments for AI, scientific, and rendering workloads.
- Maintain and optimize job schedulers (e.g., SLURM, PBS) and resource managers to ensure efficient utilization of compute and GPU resources.
- Develop and maintain automation scripts (Python, Bash) for monitoring, diagnostics, and routine maintenance tasks.
- Collaborate with hardware and software engineering teams to reproduce bugs, provide detailed logs, and drive root‑cause analysis.
- Document support procedures, best‑practice guides, and knowledge‑base articles for internal and external stakeholders.
Requirements
- 5+ years of hands‑on experience supporting Linux HPC clusters and GPU‑focused compute environments.
- Strong proficiency in CUDA development, driver stack, and performance profiling tools.
- Expertise with job schedulers (SLURM, PBS, LSF) and cluster management tools.
- Solid scripting skills in Python and Bash for automation and troubleshooting.
- Excellent problem‑solving communication skills and ability to work across distributed, cross‑functional teams.