onsite
System Administrator II - High Performance Computing Infrastructure
Systems Engineer
Experienced System Administrator to manage, automate, and optimize Linux‑based HPC clusters, ensuring high availability, performance, and security for mission‑critical scientific and analytics workloads.
About the role
Key Responsibilities
- Deploy, configure, and maintain Linux HPC nodes, storage, and interconnect fabrics across multiple sites.
- Automate provisioning, patching, and configuration management using Ansible and custom Python scripts.
- Monitor cluster health, performance, and capacity; troubleshoot job scheduler (Slurm) issues and optimize workload throughput.
- Implement and enforce security hardening, access controls, and compliance standards for all infrastructure components.
- Collaborate with developers and scientists to support application scaling, data movement, and reproducible research pipelines.
Requirements
- 3+ years of hands‑on Linux system administration in an HPC or large‑scale compute environment.
- Proficiency with job schedulers (e.g., Slurm), configuration management tools (Ansible), and scripting in Python or Bash.
- Strong understanding of networking, storage (Lustre/NFS), and virtualization/container technologies.
- Experience implementing security best practices, monitoring solutions, and performance tuning for compute clusters.
- Ability to work independently, document procedures, and communicate technical concepts to cross‑functional teams.