onsite

Principal Software Engineer - At Scale Reliability & Fleet Intelligence - NVIDIA

Software Engineer

Lead fleet‑scale reliability initiatives for hyperscale CSP customers, driving architecture, telemetry integration, and continuous improvement of NVIDIA platforms using C++, Python, and distributed systems expertise.

About the role

Key Responsibilities

Serve as the technical lead for fleet‑scale reliability across CSP and hyperscale engagements, ensuring NVIDIA platforms meet MTBI targets.
Collaborate with customer engineering teams to design and validate reliability software/firmware architecture and methodology.
Integrate telemetry and failure data from customer fleets into internal improvement pipelines, prioritizing fixes and enhancements.
Develop and maintain automated test suites, CI/CD pipelines, and monitoring dashboards to track reliability metrics.
Provide mentorship and guidance to cross‑functional teams on best practices for reliability engineering and distributed system design.

Requirements

10+ years of software engineering experience, with deep expertise in C++ and Python.
Proven track record in reliability engineering for large‑scale, distributed systems.
Strong understanding of telemetry collection, data analysis, and failure mode investigation.
Experience with CI/CD tooling, containerization, and cloud platforms (AWS preferred).
Excellent communication skills and ability to influence technical direction across multiple teams.

Skills

cpythoncicdaws

CompanyNVIDIA

DepartmentEngineering

LocationSanta Clara, California, United States

Experience7+ years

Tenurefull-time

LevelLead

Salary431,250

Posted June 27, 2026