onsite
Senior Software Development Engineer - Collectives and Network - AMD
Software Engineer
Senior engineer designing and implementing high‑performance collective communication and networking software for next‑generation AI and data‑center GPUs, leveraging C++, Python, Linux, and GPU programming expertise.
About the role
Key Responsibilities
- Design and develop scalable collective communication libraries and network stack components for GPU‑accelerated platforms.
- Implement high‑performance algorithms for data movement, synchronization, and fault tolerance across multi‑node systems.
- Collaborate with hardware architects and driver teams to integrate software solutions with new GPU architectures.
- Profile, benchmark, and optimize code paths to meet stringent latency and throughput targets.
- Maintain and evolve CI/CD pipelines, testing frameworks, and documentation for the collective and networking stack.
Requirements
- 5+ years of professional software development experience in C++ and Python on Linux.
- Deep understanding of GPU programming models (e.g., CUDA, ROCm) and low‑level networking protocols.
- Proven experience with distributed systems, high‑performance computing, or AI/ML workloads.
- Strong debugging, profiling, and performance‑tuning skills using tools such as perf, gdb, and GPU profilers.
- Excellent communication and teamwork abilities, with a track record of delivering complex software in cross‑functional environments.