About the Team
The KRAFTON MLSys & Ops Team designs, builds, and operates GPU infrastructure and ML platforms for model development and service deployment within the AI Service Division. They handle model training/experimentation environments, ML pipelines, model serving infrastructure, GPU cluster operations, infrastructure automation, and observability systems, creating a common platform for stable and efficient AI workloads required for game production and services. This position is for a hands-on senior engineer to operate and enhance the existing B300 125-node GPU infrastructure, evolving it into a stable and efficient GPU platform for research, development, and service organizations. Based on practical experience, the role will also technically propose and participate in implementing GPU/Compute operation strategies, including future GPU purchases, cloud integration, direct builds, and external infrastructure utilization.
Mission
- Operate and enhance the B300 125-node GPU infrastructure, developing it into a stable and efficient GPU platform for research, development, and service organizations.
- Directly participate in the operational stabilization, performance improvement, resource efficiency, and operational automation of the B300 125-node GPU infrastructure.
- Design, build, and operate the scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
- Develop and implement operational strategies to improve GPU utilization, latency, throughput, and cost efficiency based on the characteristics of training/inference workloads.
- Propose technical judgments regarding future GPU purchases, cloud integration, direct builds, or external infrastructure utilization based on GPU capacity planning and usage analysis.
- Enhance ML platforms and reproducible operating systems, coordinating requirements from various teams from a common platform perspective.
Requirements
- Experience in designing, building, and operating large-scale GPU clusters or Kubernetes-based ML platforms running AI/ML training or inference workloads.
- Experience in directly improving scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
- Experience in analyzing GPU utilization, resource allocation, scheduling, prioritization, and reflecting cost/performance optimization strategies in actual operations.
- Understanding of the overall ML workflow including model training, experimentation, deployment, and serving, with experience in operating and enhancing common ML platforms.
- Experience in creating and operating reproducible and repeatable platform operating standards using IaC, GitOps, CI/CD, and observability systems.
- Experience in analyzing failures and performance issues from a system-wide perspective, leading to root cause resolution and structural improvements.
- Experience collaborating with research, development, and service organizations to organize common ML/GPU platform requirements and propose technical alternatives and execution priorities.
- Experience leveraging AI tools such as generative AI, LLM-based tools, and code assistants in practice to enhance operational efficiency, problem-solving, documentation, and automation productivity.
- No disqualification for overseas business trips.
Preferred Qualifications
- Experience with GPU resource management, scheduling, and orchestration tools such as NVIDIA GPU Operator, DCGM, MIG/MPS, Run:ai, Slurm, Kueue, Volcano.
- Experience operating, validating, and optimizing next-generation GPU architectures such as B-Series/B300, H100/H200, GB200/GB300 or equivalent.
- Experience building or enhancing model serving and ML pipeline platforms such as KServe, Triton, Ray Serve, Kubeflow, Argo Workflows.
- Experience with system, network, and storage level performance optimization for Linux systems, cgroups, NUMA, I/O, container runtimes, NCCL/RDMA, InfiniBand/RoCE, Ceph/MinIO.
- Experience designing, implementing, and operating resource policies based on GPU/Cloud costs, utilization, latency, throughput, and team-specific usage.