onsite
Principal Software Engineer - At Scale Reliability & Fleet Intelligence - NVIDIA
Software Engineer
Lead fleet‑scale reliability initiatives for hyperscale CSP customers, driving architecture, telemetry integration, and continuous improvement of NVIDIA platforms using C++, Python, and distributed systems expertise.
About the role
Key Responsibilities
- Serve as the technical lead for fleet‑scale reliability across CSP and hyperscale engagements, ensuring NVIDIA platforms meet MTBI targets.
- Collaborate with customer engineering teams to design and validate reliability software/firmware architecture and methodology.
- Integrate telemetry and failure data from customer fleets into internal improvement pipelines, prioritizing fixes and enhancements.
- Develop and maintain automated test suites, CI/CD pipelines, and monitoring dashboards to track reliability metrics.
- Provide mentorship and guidance to cross‑functional teams on best practices for reliability engineering and distributed system design.
Requirements
- 10+ years of software engineering experience, with deep expertise in C++ and Python.
- Proven track record in reliability engineering for large‑scale, distributed systems.
- Strong understanding of telemetry collection, data analysis, and failure mode investigation.
- Experience with CI/CD tooling, containerization, and cloud platforms (AWS preferred).
- Excellent communication skills and ability to influence technical direction across multiple teams.