onsite
Software Development Engineer - Collectives and Network - AMD
Software Engineer
Develop and optimize collective communication and networking software for high‑performance GPUs, leveraging C++, Python, and Linux to deliver low‑latency, scalable solutions for AI and data‑center workloads.
About the role
Key Responsibilities
- Design, implement, and maintain collective communication libraries and networking stacks for GPU‑accelerated platforms.
- Collaborate with hardware and firmware teams to integrate software with next‑generation GPU architectures.
- Profile and optimize performance to achieve low latency and high throughput across multi‑node systems.
- Develop test frameworks and validation suites to ensure reliability and correctness of communication primitives.
- Contribute to open‑source and internal tooling for debugging, monitoring, and performance analysis.
Requirements
- Strong proficiency in C++ (C++14/17) and Python for system‑level development.
- Experience with Linux kernel/user‑space networking, TCP/UDP, RDMA, or similar high‑performance communication protocols.
- Solid understanding of GPU architectures and parallel programming models (e.g., CUDA, HIP).
- Demonstrated ability to profile, debug, and optimize code for latency‑critical workloads.
- Excellent problem‑solving skills and ability to work effectively in cross‑functional, collaborative teams.