remote
Linux HPC System Administrator Remote - RTX Corporation
Systems Engineer
Remote Linux HPC System Administrator responsible for deploying, managing, and optimizing high‑performance compute clusters, automating workflows with Python and Ansible, and ensuring reliable networking and storage for scientific workloads.
About the role
Key Responsibilities
- Design, install, and maintain Linux‑based HPC clusters using Slurm workload manager.
- Develop and maintain automation scripts and playbooks (Python, Ansible) for provisioning, configuration, and routine maintenance.
- Monitor system performance, troubleshoot hardware/software issues, and optimize compute, network, and storage resources.
- Implement security best practices, user access controls, and compliance with program requirements.
- Collaborate with researchers and engineers to support application scaling and workflow integration.
Requirements
- 3+ years of experience administering Linux servers in an HPC environment.
- Proficiency with Slurm or similar job schedulers and experience managing high‑speed interconnects.
- Strong scripting skills in Python and automation experience with Ansible or comparable tools.
- Knowledge of networked storage solutions (NFS, Lustre, GPFS) and performance tuning.
- U.S. citizenship and ability to meet program security requirements.