onsite

Director, Engineering Operations and Site Reliability Engineering - Datacenter Server Systems - NVIDIA

Systems Engineer

Lead engineering operations for NVIDIA’s datacenter server systems, driving reliability, scalability, and automation across GPU infrastructure using SRE principles, cloud technologies, and Python scripting.

About the role

Key Responsibilities

Lead a cross‑functional team to design, build, and operate high‑availability datacenter server systems for GPU workloads.
Implement SRE practices, including incident response, capacity planning, and reliability metrics to ensure 99.99% uptime.
Drive automation of deployment, monitoring, and scaling using Python, CI/CD pipelines, and container orchestration (Kubernetes).
Collaborate with hardware, firmware, and software teams to optimize performance and power efficiency of GPU server platforms.
Establish and enforce best practices for security, compliance, and disaster recovery across the datacenter fleet.

Requirements

10+ years of experience in large‑scale datacenter operations or SRE roles, with a focus on GPU or high‑performance computing environments.
Deep knowledge of cloud infrastructure, virtualization, and container orchestration (Kubernetes, Docker).
Proficiency in Python for automation, scripting, and tooling development.
Strong leadership skills, able to mentor and grow a high‑performing engineering team.
Excellent communication and problem‑solving abilities in a fast‑paced, innovative setting.

Skills

python

CompanyNVIDIA

DepartmentOperations

LocationSanta Clara, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Salary442,750

Posted June 27, 2026