onsite
Principal Software Engineer - Rack Scale System Software - NVIDIA
Software Engineer
Lead the design and implementation of rack‑scale system software and firmware, ensuring reliable deployment, monitoring, and recovery of GPU/NVSwitch infrastructure across fleet‑scale CSP environments.
About the role
Key Responsibilities
- Architect and develop system‑level software that manages, monitors, and recovers entire rack‑scale deployments, including GPU and NVSwitch components.
- Design and implement telemetry APIs and health monitoring frameworks to provide real‑time status and diagnostics to CSP engineering teams.
- Orchestrate firmware updates and rollouts across heterogeneous hardware, ensuring zero‑downtime and rollback capabilities.
- Collaborate with cross‑functional teams to integrate error handling, recovery, and serviceability features into the rack‑scale stack.
- Drive technical leadership for CSP engagements, translating customer requirements into robust, scalable software solutions.
Requirements
- Extensive experience in C/C++ and Linux kernel development for embedded and high‑performance systems.
- Deep knowledge of GPU architecture, NVSwitch technology, and associated firmware ecosystems.
- Proven track record designing telemetry, monitoring, and orchestration solutions for large‑scale deployments.
- Strong problem‑solving skills with a focus on reliability, fault tolerance, and serviceability.
- Excellent communication and collaboration abilities in a fast‑paced, cross‑functional environment.