onsite
Senior Reliability Engineer, DGX Cloud - NVIDIA
Software Engineer
Lead reliability and uptime for NVIDIA’s DGX Cloud platform, ensuring high availability of GPU‑accelerated AI services through advanced monitoring, incident response, and automation across Kubernetes and cloud environments.
About the role
Key Responsibilities
- Design, implement, and maintain reliability solutions for DGX Cloud, ensuring 99.99% uptime for GPU‑accelerated AI workloads.
- Develop and refine monitoring, alerting, and incident response workflows using Prometheus, Grafana, and PagerDuty.
- Collaborate with platform, security, and DevOps teams to automate capacity planning, scaling, and roll‑outs via Kubernetes and CI/CD pipelines.
- Analyze post‑incident reports, root causes, and implement preventive measures to reduce MTTR and MTBF.
- Drive continuous improvement of observability, performance testing, and chaos engineering practices.
Requirements
- 5+ years of experience in reliability or SRE roles for large‑scale cloud or GPU‑based services.
- Proficient with Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
- Strong scripting skills (Python, Bash) and familiarity with infrastructure as code (Terraform, Helm).
- Hands‑on experience with monitoring/alerting stacks (Prometheus, Grafana, Loki) and incident management tools.
- Excellent problem‑solving, communication, and collaboration skills in a fast‑paced, cross‑functional environment.