remote

Research Scientist - LLM Training System as a Service PhD

Research Engineer

PhD‑level Research Scientist to design and optimize a cloud‑native LLM training platform, focusing on CUDA‑based distributed GPU performance, scaling, and cost‑effective model training for next‑generation AI services.

About the role

Key Responsibilities

Design and implement a scalable, service‑oriented training system for large language models (LLMs) that runs on multi‑GPU clusters.
Develop and benchmark CUDA kernels and communication primitives to maximize GPU utilization and minimize training latency.
Research novel distributed training algorithms, including pipeline, tensor, and data parallelism, and integrate them into the platform.
Collaborate with software engineers to expose training capabilities via APIs and orchestration tools for internal and external users.
Publish research findings in top conferences/journals and contribute to open‑source deep‑learning frameworks.

Requirements

PhD in Computer Science, Electrical Engineering, or a related field with a focus on high‑performance computing, deep learning, or systems research.
Strong expertise in CUDA programming, GPU architecture, and performance profiling/optimization.
Hands‑on experience with distributed training of LLMs using frameworks such as PyTorch, DeepSpeed, or Megatron‑LM.
Proven track record of research publications and/or contributions to open‑source projects.
Excellent problem‑solving skills and ability to work autonomously in a fast‑moving research environment.

Skills

cudapython

DepartmentResearch

LocationSan Jose, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 26, 2026