onsite
Senior HPC SYSTEMS administrator - University of Oxford
Systems Engineer
Lead the design, deployment, and optimisation of HPC infrastructure for AI and computer vision research, managing Linux clusters, GPU resources, and cloud integration.
About the role
Key Responsibilities
- Design, deploy, and maintain Linux‑based HPC clusters, ensuring high availability and performance for AI workloads.
- Configure and optimise GPU resources, including NVIDIA CUDA, for deep learning and computer vision pipelines.
- Implement and manage workload scheduling with Slurm, tailoring policies for research groups.
- Integrate on‑premise infrastructure with AWS services (ECS, S3, EC2) to support hybrid cloud workflows.
- Develop automation scripts (Python, Bash) for routine administration and monitoring.
- Collaborate with researchers to understand computational needs and provide technical guidance.
Requirements
- 5+ years of experience administering HPC or large‑scale Linux systems.
- Strong knowledge of GPU computing, CUDA, and deep learning frameworks.
- Proficiency with Slurm, Ansible, and cloud platforms (AWS preferred).
- Excellent scripting skills in Python or Bash.
- Effective communication and teamwork in an academic research environment.