About KRAFTON AI Service Division
KRAFTON AI Service Division collaborates with various internal and external sectors to provide AI solutions for diverse problems and develops proprietary services through in-house deep learning research. Our direction is broadly categorized into four areas:
- Production Cost Down: Applying deep learning technology to game production processes to increase efficiency and innovate the work experience for creators.
- New Way to Create: Expanding the methods of game creation with various deep learning technologies, including generative AI.
- Virtual Friends: Developing deep learning-based Virtual Friends to create new user experiences inside and outside games.
- Unique, Endless Gameplay: Implementing game content that provides users with new experiences every time through deep learning.
Team Introduction: KRAFTON MLSys & Ops Team
The KRAFTON MLSys & Ops Team designs, builds, and operates GPU infrastructure and ML platforms for model development and service application within the AI Service Division. We handle model training/experiment environments, ML pipelines, model serving infrastructure, GPU cluster operations, infrastructure automation, and observability systems. Our goal is to create a common platform that ensures AI workloads necessary for game production and services operate stably and efficiently.
This position is for a hands-on Senior MLOps Engineer responsible for operating and enhancing the existing B300 125-node based GPU infrastructure, evolving it into a GPU platform that research, development, and service organizations can use stably and efficiently. Additionally, based on insights gained from practical experience, you will technically propose and participate in the execution of a GPU/Compute operating strategy, including future GPU purchases, parallel cloud usage, in-house construction, and utilization of external infrastructure.
Your Mission
- Participate directly in the stabilization, performance improvement, resource efficiency, and operational automation of the B300 125-node based GPU infrastructure.
- Design, build, and operate scheduling, multi-tenancy, workload isolation, quotas, observability, and fault response systems for Kubernetes-based ML/GPU platforms.
- Establish operational strategies to improve GPU utilization, latency, throughput, and cost efficiency based on training/inference workload characteristics, and implement them in the actual platform.
- Propose technical judgments regarding future GPU purchases, parallel cloud usage, in-house construction, and utilization of external infrastructure based on GPU capacity planning and utilization analysis.
- Enhance the ML platform and reproducible operating system, and coordinate the requirements of various teams from a common platform perspective.
Required Experience
- Experience in designing, building, and operating large-scale GPU clusters or Kubernetes-based ML platforms where AI/ML training or inference workloads run.
- Experience in directly improving scheduling, multi-tenancy, workload isolation, quotas, observability, and fault response systems for Kubernetes-based ML/GPU platforms.
- Experience in applying cost/performance optimization strategies based on GPU utilization analysis, resource allocation, scheduling, and prioritization to actual operations.
- Understanding the entire ML workflow (model training, experimentation, deployment, serving) and experience in operating and enhancing a common ML platform.
- Experience in creating and operating reproducible and repeatable platform operation standards using IaC, GitOps, CI/CD, and observability systems.
- Experience in analyzing failures and performance issues from a system-wide perspective and connecting them to root cause resolution and structural improvements.
- Experience collaborating with research, development, and service organizations to organize common ML/GPU platform requirements and propose technical alternatives and execution priorities.
- Experience using AI tools such as generative AI, LLM-based tools, and code assistants in practice to improve operational efficiency, problem-solving, documentation, and automation productivity.
- No disqualification for overseas business trips.
Preferred Experience
- Experience using GPU resource management, scheduling, and orchestration tools such as NVIDIA GPU Operator, DCGM, MIG/MPS, Run:ai, Slurm, Kueue, Volcano.
- Experience in operating, verifying, and optimizing **B-Series/B300, H100/H200, GB200/GB300**, or equivalent next-generation GPU architectures.
- Experience in building or enhancing model serving and ML pipeline platforms such as KServe, Triton, Ray Serve, Kubeflow, Argo Workflows.
- Experience in system, network, and storage level performance optimization for Linux systems, cgroups, NUMA, I/O, container runtimes, NCCL/RDMA, InfiniBand/RoCE, Ceph/MinIO.
- Experience in designing, implementing, and operating resource policies based on GPU/Cloud costs, utilization, latency, throughput, and team-specific usage.