onsite
AI Infrastructure Engineer - 42dot
Devops Engineer
Lead the design, deployment, and optimization of a multi‑data‑center GPU cluster, using Kubernetes and Slurm to ensure high availability, scalability, and efficient resource utilization for world‑class AI workloads.
About the role
Key Responsibilities
- Operate and maintain thousands of GPUs across multiple data centers using Kubernetes and Slurm for cluster orchestration.
- Monitor GPU hardware and software stack, diagnose incidents, and execute rapid recovery to sustain high availability.
- Develop automation tools and scripts in Python or Shell to streamline repetitive infrastructure tasks.
- Collaborate with data scientists and ML engineers to optimize resource allocation and performance for AI workloads.
- Implement monitoring dashboards, alerting, and capacity planning to support scaling and operational excellence.
Requirements
- Strong experience with Kubernetes, Slurm, and GPU cluster management.
- Proficiency in Python and Shell scripting for automation.
- Hands‑on knowledge of GPU hardware, drivers, and performance tuning.
- Solid understanding of monitoring, logging, and incident response in distributed systems.
- Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.