remote
HPC Systems Engineer - UC San Diego
Systems Engineer
Lead the design, deployment, and optimization of high‑performance computing clusters, ensuring robust Linux environments, efficient scheduling with SLURM, and top‑tier performance tuning across hybrid on‑prem and cloud infrastructures.
About the role
Key Responsibilities
- Design, install, and maintain HPC clusters, including compute nodes, storage, and networking components.
- Configure and optimize SLURM workloads, ensuring efficient job scheduling and resource allocation.
- Develop and maintain Python scripts for automation, monitoring, and performance analysis.
- Collaborate with researchers to troubleshoot performance bottlenecks and implement tuning strategies.
- Integrate hybrid cloud resources (AWS/GCP) to extend capacity and provide elastic compute options.
- Document system configurations, procedures, and best practices for internal use.
Requirements
- Strong experience with Linux system administration and HPC cluster environments.
- Proficiency in SLURM or equivalent workload managers.
- Hands‑on scripting skills in Python for automation and data analysis.
- Knowledge of HPC networking, high‑speed interconnects, and storage solutions.
- Experience with cloud platforms (AWS, GCP) and hybrid deployment models is a plus.
Skills
machine learningpythonbashlinux