onsite
Director, Engineering Operations and Site Reliability Engineering - Datacenter Server Systems - NVIDIA
Systems Engineer
Lead engineering operations for NVIDIA’s datacenter server systems, driving reliability, scalability, and automation across GPU infrastructure using SRE principles, cloud technologies, and Python scripting.
About the role
Key Responsibilities
- Lead a cross‑functional team to design, build, and operate high‑availability datacenter server systems for GPU workloads.
- Implement SRE practices, including incident response, capacity planning, and reliability metrics to ensure 99.99% uptime.
- Drive automation of deployment, monitoring, and scaling using Python, CI/CD pipelines, and container orchestration (Kubernetes).
- Collaborate with hardware, firmware, and software teams to optimize performance and power efficiency of GPU server platforms.
- Establish and enforce best practices for security, compliance, and disaster recovery across the datacenter fleet.
Requirements
- 10+ years of experience in large‑scale datacenter operations or SRE roles, with a focus on GPU or high‑performance computing environments.
- Deep knowledge of cloud infrastructure, virtualization, and container orchestration (Kubernetes, Docker).
- Proficiency in Python for automation, scripting, and tooling development.
- Strong leadership skills, able to mentor and grow a high‑performing engineering team.
- Excellent communication and problem‑solving abilities in a fast‑paced, innovative setting.