remoteonsite
Hybrid Hardware & Software Support Engineer - HPC - Atos
Software Engineer
Provide expert on‑site and remote support for high‑performance computing systems, integrating hardware and software components, diagnosing issues, and ensuring optimal performance of HPC clusters using Linux, C++, and Python.
About the role
Key Responsibilities
- Deliver first‑line and advanced support for HPC clusters, including servers, interconnects, storage, and accelerators.
- Diagnose and resolve hardware failures, firmware issues, and software bugs across Linux environments.
- Collaborate with R&D and system architects to implement performance optimizations and configuration changes.
- Develop and maintain automation scripts and monitoring tools using Python and Bash to streamline incident handling.
- Provide technical guidance and training to customers and internal teams on best practices for HPC operation.
Requirements
- Strong experience with Linux system administration and networking in HPC contexts.
- Proficiency in C++ and Python for debugging, scripting, and tool development.
- Hands‑on knowledge of hardware components such as CPUs, GPUs, high‑speed interconnects (InfiniBand), and storage systems.
- Familiarity with parallel programming models (MPI, OpenMP) and performance analysis tools.
- Excellent problem‑solving skills and ability to work under pressure in a hybrid (remote/on‑site) support model.