onsite

Senior MLOps Engineer

MLOps Engineer

KRAFTON is seeking a Senior MLOps Engineer to operate and enhance their B300 125-node GPU infrastructure, transforming it into a robust GPU platform. This role involves stabilizing operations, improving performance, and developing strategies for GPU utilization and future infrastructure decisions to support AI workloads in game development and services.

About the role

About the Team

The KRAFTON MLSys & Ops Team designs, builds, and operates GPU infrastructure and ML platforms for model development and service deployment within the AI Service Division. They handle model training/experimentation environments, ML pipelines, model serving infrastructure, GPU cluster operations, infrastructure automation, and observability systems, creating a common platform for stable and efficient AI workloads required for game production and services. This position is for a hands-on senior engineer to operate and enhance the existing B300 125-node GPU infrastructure, evolving it into a stable and efficient GPU platform for research, development, and service organizations. Based on practical experience, the role will also technically propose and participate in implementing GPU/Compute operation strategies, including future GPU purchases, cloud integration, direct builds, and external infrastructure utilization.

Mission

Operate and enhance the B300 125-node GPU infrastructure, developing it into a stable and efficient GPU platform for research, development, and service organizations.
Directly participate in the operational stabilization, performance improvement, resource efficiency, and operational automation of the B300 125-node GPU infrastructure.
Design, build, and operate the scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
Develop and implement operational strategies to improve GPU utilization, latency, throughput, and cost efficiency based on the characteristics of training/inference workloads.
Propose technical judgments regarding future GPU purchases, cloud integration, direct builds, or external infrastructure utilization based on GPU capacity planning and usage analysis.
Enhance ML platforms and reproducible operating systems, coordinating requirements from various teams from a common platform perspective.

Requirements

Experience in designing, building, and operating large-scale GPU clusters or Kubernetes-based ML platforms running AI/ML training or inference workloads.
Experience in directly improving scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
Experience in analyzing GPU utilization, resource allocation, scheduling, prioritization, and reflecting cost/performance optimization strategies in actual operations.
Understanding of the overall ML workflow including model training, experimentation, deployment, and serving, with experience in operating and enhancing common ML platforms.
Experience in creating and operating reproducible and repeatable platform operating standards using IaC, GitOps, CI/CD, and observability systems.
Experience in analyzing failures and performance issues from a system-wide perspective, leading to root cause resolution and structural improvements.
Experience collaborating with research, development, and service organizations to organize common ML/GPU platform requirements and propose technical alternatives and execution priorities.
Experience leveraging AI tools such as generative AI, LLM-based tools, and code assistants in practice to enhance operational efficiency, problem-solving, documentation, and automation productivity.
No disqualification for overseas business trips.

Preferred Qualifications

Experience with GPU resource management, scheduling, and orchestration tools such as NVIDIA GPU Operator, DCGM, MIG/MPS, Run:ai, Slurm, Kueue, Volcano.
Experience operating, validating, and optimizing next-generation GPU architectures such as B-Series/B300, H100/H200, GB200/GB300 or equivalent.
Experience building or enhancing model serving and ML pipeline platforms such as KServe, Triton, Ray Serve, Kubeflow, Argo Workflows.
Experience with system, network, and storage level performance optimization for Linux systems, cgroups, NUMA, I/O, container runtimes, NCCL/RDMA, InfiniBand/RoCE, Ceph/MinIO.
Experience designing, implementing, and operating resource policies based on GPU/Cloud costs, utilization, latency, throughput, and team-specific usage.

About the role

About the Team

Mission

Operate and enhance the B300 125-node GPU infrastructure, developing it into a stable and efficient GPU platform for research, development, and service organizations.
Directly participate in the operational stabilization, performance improvement, resource efficiency, and operational automation of the B300 125-node GPU infrastructure.
Design, build, and operate the scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
Develop and implement operational strategies to improve GPU utilization, latency, throughput, and cost efficiency based on the characteristics of training/inference workloads.
Propose technical judgments regarding future GPU purchases, cloud integration, direct builds, or external infrastructure utilization based on GPU capacity planning and usage analysis.
Enhance ML platforms and reproducible operating systems, coordinating requirements from various teams from a common platform perspective.

Requirements

Experience in designing, building, and operating large-scale GPU clusters or Kubernetes-based ML platforms running AI/ML training or inference workloads.
Experience in directly improving scheduling, multi-tenancy, workload isolation, quotas, observability, and disaster recovery systems for Kubernetes-based ML/GPU platforms.
Experience in analyzing GPU utilization, resource allocation, scheduling, prioritization, and reflecting cost/performance optimization strategies in actual operations.
Understanding of the overall ML workflow including model training, experimentation, deployment, and serving, with experience in operating and enhancing common ML platforms.
Experience in creating and operating reproducible and repeatable platform operating standards using IaC, GitOps, CI/CD, and observability systems.
Experience in analyzing failures and performance issues from a system-wide perspective, leading to root cause resolution and structural improvements.
Experience collaborating with research, development, and service organizations to organize common ML/GPU platform requirements and propose technical alternatives and execution priorities.
Experience leveraging AI tools such as generative AI, LLM-based tools, and code assistants in practice to enhance operational efficiency, problem-solving, documentation, and automation productivity.
No disqualification for overseas business trips.

Preferred Qualifications

Experience with GPU resource management, scheduling, and orchestration tools such as NVIDIA GPU Operator, DCGM, MIG/MPS, Run:ai, Slurm, Kueue, Volcano.
Experience operating, validating, and optimizing next-generation GPU architectures such as B-Series/B300, H100/H200, GB200/GB300 or equivalent.
Experience building or enhancing model serving and ML pipeline platforms such as KServe, Triton, Ray Serve, Kubeflow, Argo Workflows.
Experience with system, network, and storage level performance optimization for Linux systems, cgroups, NUMA, I/O, container runtimes, NCCL/RDMA, InfiniBand/RoCE, Ceph/MinIO.
Experience designing, implementing, and operating resource policies based on GPU/Cloud costs, utilization, latency, throughput, and team-specific usage.

Senior MLOps Engineer

About the role

About the Team

Mission

Requirements

Preferred Qualifications

Senior MLOps Engineer

About the role

About the Team

Mission

Requirements

Preferred Qualifications

Skills