onsite

AI Infrastructure Engineer - 42dot

Devops Engineer

Lead the design, deployment, and optimization of a multi‑data‑center GPU cluster, using Kubernetes and Slurm to ensure high availability, scalability, and efficient resource utilization for world‑class AI workloads.

About the role

Key Responsibilities

Operate and maintain thousands of GPUs across multiple data centers using Kubernetes and Slurm for cluster orchestration.
Monitor GPU hardware and software stack, diagnose incidents, and execute rapid recovery to sustain high availability.
Develop automation tools and scripts in Python or Shell to streamline repetitive infrastructure tasks.
Collaborate with data scientists and ML engineers to optimize resource allocation and performance for AI workloads.
Implement monitoring dashboards, alerting, and capacity planning to support scaling and operational excellence.

Requirements

Strong experience with Kubernetes, Slurm, and GPU cluster management.
Proficiency in Python and Shell scripting for automation.
Hands‑on knowledge of GPU hardware, drivers, and performance tuning.
Solid understanding of monitoring, logging, and incident response in distributed systems.
Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.

Skills

kubernetespython

Company42dot

DepartmentEngineering

LocationPangyo, Korea, Republic of

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 21, 2026