onsite
Server Validator & AI Debug Analyst - AMD
Software Engineer
Lead server validation and AI debugging, ensuring high‑performance hardware meets AI workloads. Leverage Python, C++, and advanced debugging tools to analyze, troubleshoot, and optimize server behavior for next‑generation AI and data‑center applications.
About the role
Key Responsibilities
- Design and execute comprehensive server validation plans for AI workloads, ensuring reliability and performance targets are met.
- Diagnose and resolve complex hardware and software issues using Python, C++, and industry‑standard debugging tools.
- Collaborate with cross‑functional teams to analyze AI model behavior, identify bottlenecks, and recommend hardware optimizations.
- Develop automated test scripts and validation frameworks to accelerate defect detection and regression testing.
- Document findings, create detailed reports, and present actionable insights to engineering and product teams.
Requirements
- Strong experience in server validation, hardware testing, and AI debugging.
- Proficiency in Python and C++ for test development and data analysis.
- Deep understanding of debugging tools (e.g., GDB, SystemTap, Intel VTune) and performance profiling.
- Excellent problem‑solving skills and ability to work independently in a fast‑paced environment.
- Effective communication skills for cross‑team collaboration and technical documentation.