onsite
Senior System Administrator - High Performance Computing
Systems Engineer
Senior System Administrator responsible for designing, securing, and maintaining multi‑vendor HPC clusters, storage, and networking, using Linux, Slurm, Ansible, and Python to support mission‑critical scientific workloads.
About the role
Key Responsibilities
- Design, deploy, and operate high‑performance computing clusters across heterogeneous vendor platforms.
- Implement and maintain job scheduling and resource management using Slurm, ensuring optimal utilization and reliability.
- Automate system provisioning, configuration, and patch management with Ansible and custom Python scripts.
- Manage high‑throughput storage solutions (e.g., Lustre, GPFS) and ensure data integrity and performance.
- Monitor system health, troubleshoot hardware/software issues, and coordinate with vendors for rapid resolution.
- Enforce security hardening, access controls, and compliance standards for all HPC infrastructure.
Requirements
- 5+ years of experience administering Linux‑based HPC environments.
- Strong knowledge of cluster scheduling (Slurm) and parallel file systems (Lustre, GPFS).
- Proficiency in automation tools such as Ansible and scripting in Python or Bash.
- Experience with virtualization/container technologies (KVM, Docker) and high‑speed networking (InfiniBand, Ethernet).
- Demonstrated ability to troubleshoot complex hardware and software issues in mission‑critical settings.