onsite
Lead Systems Engineer HPC - Princeton University
Systems Engineer
Lead the design, deployment, and maintenance of HPC and AI infrastructure, collaborating with researchers and vendors to deliver scalable, high‑performance computing solutions on Linux platforms.
About the role
Key Responsibilities
- Design, install, and manage HPC clusters, ensuring optimal performance and reliability for research workloads.
- Collaborate with faculty, researchers, and vendors to specify hardware and software requirements for AI and HPC projects.
- Configure and maintain cluster software stacks, including MPI, Slurm, and GPU drivers, and implement security and compliance policies.
- Monitor system performance, troubleshoot issues, and implement capacity planning and scaling strategies.
- Provide technical guidance and training to research staff on HPC best practices and emerging AI technologies.
Requirements
- Extensive experience with Linux-based HPC environments and cluster management tools.
- Proficiency in GPU computing, MPI, and job scheduling systems such as Slurm.
- Strong understanding of networking, storage, and virtualization technologies in a research context.
- Excellent communication skills and ability to work collaboratively with interdisciplinary teams.
- Experience with AI frameworks (e.g., TensorFlow, PyTorch) is a plus.