onsite
Linux System Administrator II - HPC Infrastructure and Sustainment
Systems Engineer
Senior Linux System Administrator responsible for designing, deploying, and sustaining multi‑vendor HPC clusters, automating operations with Ansible and scripting, and ensuring high‑availability storage and network services for mission‑critical workloads.
About the role
Key Responsibilities
- Deploy, configure, and maintain Linux‑based HPC servers and clusters across multiple vendors.
- Automate provisioning, patching, and configuration management using Ansible and custom Python/Bash scripts.
- Monitor system health, performance, and capacity; troubleshoot hardware, network, and storage issues to meet SLA targets.
- Implement and manage high‑availability storage solutions (e.g., Lustre, GPFS) and high‑speed interconnects.
- Collaborate with security and application teams to harden systems and support scientific and intelligence workloads.
Requirements
- 5+ years of Linux systems administration experience, preferably in an HPC environment.
- Strong knowledge of HPC job schedulers (Slurm, PBS) and parallel file systems.
- Proficiency with Ansible automation and scripting in Python or Bash.
- Experience with virtualization, containerization, and network configuration for high‑throughput computing.
- Demonstrated ability to troubleshoot complex hardware, storage, and networking problems in mission‑critical settings.
Skills
linuxansiblepythonbash