remote
Research Scientist - LLM Training System as a Service PhD
Research Engineer
PhD‑level Research Scientist to design and optimize a cloud‑native LLM training platform, focusing on CUDA‑based distributed GPU performance, scaling, and cost‑effective model training for next‑generation AI services.
About the role
Key Responsibilities
- Design and implement a scalable, service‑oriented training system for large language models (LLMs) that runs on multi‑GPU clusters.
- Develop and benchmark CUDA kernels and communication primitives to maximize GPU utilization and minimize training latency.
- Research novel distributed training algorithms, including pipeline, tensor, and data parallelism, and integrate them into the platform.
- Collaborate with software engineers to expose training capabilities via APIs and orchestration tools for internal and external users.
- Publish research findings in top conferences/journals and contribute to open‑source deep‑learning frameworks.
Requirements
- PhD in Computer Science, Electrical Engineering, or a related field with a focus on high‑performance computing, deep learning, or systems research.
- Strong expertise in CUDA programming, GPU architecture, and performance profiling/optimization.
- Hands‑on experience with distributed training of LLMs using frameworks such as PyTorch, DeepSpeed, or Megatron‑LM.
- Proven track record of research publications and/or contributions to open‑source projects.
- Excellent problem‑solving skills and ability to work autonomously in a fast‑moving research environment.