remote
Senior Systems Software Engineer, Accelerated Kubernetes Performance and Scale - NVIDIA
Software Engineer
Senior engineer driving high‑performance, scalable Kubernetes solutions for AI workloads, leveraging C++, Linux, and GPU programming to optimize cluster efficiency and accelerate DGX Cloud services.
About the role
Key Responsibilities
- Design and implement performance‑critical components for Kubernetes that enable massive AI workloads on GPU‑accelerated clusters.
- Develop, profile, and tune C++ and CUDA code paths to maximize throughput and reduce latency across distributed systems.
- Collaborate with hardware, driver, and cloud teams to integrate GPU resources seamlessly into container orchestration pipelines.
- Build tooling and automation for benchmarking, monitoring, and scaling Kubernetes clusters in production environments.
- Contribute to open‑source and internal projects, providing technical leadership and mentorship to junior engineers.
Requirements
- 5+ years of systems software development experience, primarily in C++ on Linux.
- Deep expertise with Kubernetes architecture, custom controllers, and scheduler extensions.
- Strong background in GPU programming (CUDA, cuDNN) and performance optimization for AI/ML workloads.
- Proven ability to profile, debug, and improve large‑scale distributed systems.
- Excellent problem‑solving skills and ability to work cross‑functionally in a fast‑paced environment.