onsite

System Administrator II - High Performance Computing Infrastructure

Systems Engineer

Experienced System Administrator to manage, automate, and optimize Linux‑based HPC clusters, ensuring high availability, performance, and security for mission‑critical scientific and analytics workloads.

About the role

Key Responsibilities

Deploy, configure, and maintain Linux HPC nodes, storage, and interconnect fabrics across multiple sites.
Automate provisioning, patching, and configuration management using Ansible and custom Python scripts.
Monitor cluster health, performance, and capacity; troubleshoot job scheduler (Slurm) issues and optimize workload throughput.
Implement and enforce security hardening, access controls, and compliance standards for all infrastructure components.
Collaborate with developers and scientists to support application scaling, data movement, and reproducible research pipelines.

Requirements

3+ years of hands‑on Linux system administration in an HPC or large‑scale compute environment.
Proficiency with job schedulers (e.g., Slurm), configuration management tools (Ansible), and scripting in Python or Bash.
Strong understanding of networking, storage (Lustre/NFS), and virtualization/container technologies.
Experience implementing security best practices, monitoring solutions, and performance tuning for compute clusters.
Ability to work independently, document procedures, and communicate technical concepts to cross‑functional teams.

Skills

linuxansiblepython

DepartmentEngineering

LocationLehi, Utah, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 25, 2026