onsite
Senior Director of Software Development - Oracle
Engineering Manager
Lead the AI Infrastructure team to design, build, and operate high‑availability provisioning, monitoring, and validation systems for large‑scale GPU clusters, leveraging Python, C++, Linux, Kubernetes and cloud services.
About the role
Key Responsibilities
- Define and execute the product roadmap for provisioning, repair, monitoring, and validation platforms that support massive GPU clusters.
- Architect and oversee the development of firmware management solutions for GPUs and high‑speed NICs.
- Drive high‑availability and performance engineering practices across the data‑plane and GPU specialization layers.
- Collaborate with hardware, cloud, and AI teams to ensure seamless integration and rapid time‑to‑market for AI workloads.
- Mentor senior engineers, foster a culture of technical excellence, and manage cross‑functional delivery milestones.
Requirements
- 10+ years of software engineering experience, with at least 5 years in a leadership role building large‑scale, high‑performance systems.
- Deep expertise in Python and C++ development on Linux platforms.
- Strong background in GPU computing, firmware development, and networking (high‑speed NICs).
- Hands‑on experience with container orchestration (Kubernetes) and cloud infrastructure (AWS or equivalent).
- Proven track record delivering highly available, low‑latency services in a distributed environment.
Skills
pythonclinuxkubernetesaws