onsite

Senior Reliability Engineer, DGX Cloud - NVIDIA

Software Engineer

Lead reliability and uptime for NVIDIA’s DGX Cloud platform, ensuring high availability of GPU‑accelerated AI services through advanced monitoring, incident response, and automation across Kubernetes and cloud environments.

About the role

Key Responsibilities

Design, implement, and maintain reliability solutions for DGX Cloud, ensuring 99.99% uptime for GPU‑accelerated AI workloads.
Develop and refine monitoring, alerting, and incident response workflows using Prometheus, Grafana, and PagerDuty.
Collaborate with platform, security, and DevOps teams to automate capacity planning, scaling, and roll‑outs via Kubernetes and CI/CD pipelines.
Analyze post‑incident reports, root causes, and implement preventive measures to reduce MTTR and MTBF.
Drive continuous improvement of observability, performance testing, and chaos engineering practices.

Requirements

5+ years of experience in reliability or SRE roles for large‑scale cloud or GPU‑based services.
Proficient with Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
Strong scripting skills (Python, Bash) and familiarity with infrastructure as code (Terraform, Helm).
Hands‑on experience with monitoring/alerting stacks (Prometheus, Grafana, Loki) and incident management tools.
Excellent problem‑solving, communication, and collaboration skills in a fast‑paced, cross‑functional environment.

Skills

kubernetes

CompanyNVIDIA

DepartmentEngineering

LocationSanta Clara, California, United States

Experience5+ years

Tenurefull-time

LevelSenior

Salary333,500

Posted June 23, 2026