onsite
Senior Software Engineer, GPU Baremetal
Senior Software Engineer, GPU Baremetal
NVIDIA is looking for a Senior Software Engineer to join the GPU Baremetal team, responsible for designing and implementing software to manage the lifecycle of thousands of GPU servers in a distributed, baremetal environment. This role involves developing scalable and fault-tolerant services for provisioning, configuration, and monitoring GPU infrastructure, collaborating with cross-functional teams, and driving continuous improvement in system reliability.
About the role
About the team
NVIDIA is seeking a Senior Software Engineer to join the GPU Baremetal team. This team is at the forefront of building and operating GPU infrastructure for NVIDIA's internal AI/ML efforts. We build and manage thousands of GPU servers and are scaling rapidly.
We are looking for someone with a passion for designing and developing robust, scalable, and highly available services. You will be responsible for defining and implementing the software that manages the lifecycle of GPU servers.
What you'll be doing
- Design and implement software that manages the lifecycle of thousands of GPU servers in a distributed, baremetal environment.
- Develop and maintain scalable, highly available, and fault-tolerant services that orchestrate provisioning, configuration, and monitoring of GPU infrastructure.
- Work closely with internal customers to understand their requirements and translate them into technical solutions.
- Collaborate with cross-functional teams to ensure seamless integration of our services with other NVIDIA platforms.
- Drive continuous improvement in system reliability, performance, and efficiency through monitoring, alerting, and automation.
- Mentor junior engineers and contribute to a culture of technical excellence.
What we need to see
- B.S., M.S. or Ph.D. in Computer Science or a related field (or equivalent experience).
- 5+ years of experience in software development, with a focus on distributed systems and baremetal infrastructure.
- Proficiency in C++ and Python.
- Strong understanding of Linux operating systems and networking concepts.
- Experience with building and operating large-scale, highly available services.
- Excellent problem-solving, debugging, and communication skills.
Ways to stand out from the crowd
- Experience with GPU infrastructure and High-Performance Computing (HPC) environments.
- Familiarity with data center networking and automation.
- Prior experience with REST APIs, gRPC, and service mesh technologies.
- Proven track record of delivering high-quality software in a fast-paced, agile environment.