onsite
Software Developer 5, AI Infrastructure - Oracle
Software Engineer
Senior software engineer focused on designing and building ultra‑high‑performance GPU platforms for AI, ML, and HPC workloads, leveraging RoCE, Infiniband, and advanced monitoring to scale thousands of GPUs with minimal latency.
About the role
Key Responsibilities
- Design and develop architectural changes for GPU delivery, health monitoring, and diagnostic services across large‑scale AI/ML/HPC deployments.
- Implement and optimize distributed systems using RoCE and Infiniband to enable low‑latency, high‑throughput GPU clusters.
- Automate testing, triage, and performance analysis to ensure reliability and scalability of GPU workloads.
- Collaborate with cross‑functional teams to integrate new features into the OCI AI Infrastructure platform.
- Continuously evaluate emerging GPU technologies and propose enhancements to maintain competitive edge.
Requirements
- 5+ years of software development experience in high‑performance computing or GPU‑centric environments.
- Strong proficiency in C++/Python and experience with low‑level networking protocols such as RoCE and Infiniband.
- Hands‑on experience with distributed systems, performance profiling, and automated testing frameworks.
- Excellent problem‑solving skills and ability to work independently in a fast‑paced, consulting‑style environment.
- Effective communication skills and a collaborative mindset for cross‑team coordination.
Skills
pythongojavalinuxagile