hybrid

Training Infrastructure Engineer

The Training Infrastructure Engineer will be responsible for optimizing the full training stack, focusing on GPU behavior profiling, debugging training pipelines, and improving throughput for efficient model training at scale. This role involves working across cluster management, model training, and efficient data pipelines to build the foundation for generative models.

About the role

About the Role

In this role, you’ll focus on the full training stack - profiling GPU behavior, debugging training pipelines, improving throughput, choosing the right parallelism strategies, and designing the infrastructure that lets us train models efficiently at scale. You’ll work across cluster management, model training, efficient data pipelines for video and audio, inference and optimizing pytorch code. Your work will shape the foundation on which all of our generative models are built and iterated.

Key Responsibilities

Find ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loads
Profile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware level
Analyze and improve the whole training pipeline from start to end (efficient data storage, data loading, distributed training, checkpoint/artifact saving, logging, …)
Set up scalable systems for experiment tracking, data/model versioning, experiment insights.
Design, deploy and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

Ideal Candidate Profile

Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them
Deep understanding of GPU memory hierarchy and computation capabilities—knowing what the hardware can do theoretically and what prevents us from achieving it
Experience optimizing for both memory-bound and compute-bound operations and understanding when each constraint matters
Expertise with efficient attention algorithms and their performance characteristics at different scales

Nice to Have

Experience in implementing custom GPU kernels and integrating them into PyTorch.
Experience with diffusion and autoregressive models and understanding of their specific optimization challenges
Familiarity with high-performance storage solutions (VAST, blob storage) and understanding of their performance characteristics for ML workloads
Experience with managing SLURM clusters at scale

About the role

About the Role

Key Responsibilities

Find ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loads
Profile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware level
Analyze and improve the whole training pipeline from start to end (efficient data storage, data loading, distributed training, checkpoint/artifact saving, logging, …)
Set up scalable systems for experiment tracking, data/model versioning, experiment insights.
Design, deploy and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

Ideal Candidate Profile

Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them
Deep understanding of GPU memory hierarchy and computation capabilities—knowing what the hardware can do theoretically and what prevents us from achieving it
Experience optimizing for both memory-bound and compute-bound operations and understanding when each constraint matters
Expertise with efficient attention algorithms and their performance characteristics at different scales

Nice to Have

Experience in implementing custom GPU kernels and integrating them into PyTorch.
Experience with diffusion and autoregressive models and understanding of their specific optimization challenges
Familiarity with high-performance storage solutions (VAST, blob storage) and understanding of their performance characteristics for ML workloads
Experience with managing SLURM clusters at scale

Training Infrastructure Engineer

About the role

About the Role

Key Responsibilities

Ideal Candidate Profile

Nice to Have

Training Infrastructure Engineer

About the role

About the Role

Key Responsibilities

Ideal Candidate Profile

Nice to Have

Skills