onsite

Senior ML Infrastructure Engineer

ML Infrastructure Engineer

Prior Labs is seeking a Senior ML Infrastructure Engineer to own and evolve their multi-cluster GPU infrastructure, which currently spans Slurm on GCP. This role involves architecting the next generation of their infrastructure, driving GPU utilization, and building developer productivity tools, all while managing a significant compute budget.

About the role

About the Role

We spend tens of millions per year on GPU compute to train tabular foundation models. That's not a target, it's what we're running today, and it's growing. The person who owns this infrastructure makes decisions worth millions of dollars: cluster architecture, scheduling efficiency, provider strategy, hardware selection. A wrong call costs six figures.

Today we run Slurm on GCP across multiple clusters. We're scaling to multi-cluster, multi-provider infrastructure and evaluating new hardware generations as they come online. You own the full stack, from cluster operations and cost optimization to distributed training performance and the tooling layer that keeps researchers moving fast. You work directly with the research team and understand what they're doing well enough to make infrastructure decisions that actually help them. And this isn't a pure support role. We operate an open environment. If you've got the next SOTA tabular architecture up your sleeve, go ahead and train it.

What you'll work on:

Own and evolve multi-cluster GPU infrastructure. Slurm on GCP today, multi-provider and new hardware tomorrow. Architecture, scheduling, reliability, cost optimization
Drive GPU utilization and training throughput: profiling, memory optimization, communication bottlenecks, systems-level debugging of distributed training across large runs
Architect the next generation of our infrastructure: multi-cluster orchestration, new GPU generations, provider diversification, capacity planning against growing compute demands
Build the developer productivity layer: CI pipelines, experiment tracking, model registry, data processing, and internal tooling that keeps research iteration speed high
Own the compute budget. You understand cost per FLOP across providers and hardware, and you hate wasted compute

Tech stack:

Slurm, GCP, Docker, wandb, GitHub Actions, uv, PyTorch, Triton

You may be a good fit if you have:

5+ years building and operating production GPU infrastructure or distributed training systems at scale. At a major AI lab, a well-funded ML startup, or an HPC environment
Deep hands-on experience with Slurm and cluster management. You've debugged scheduling failures, optimized utilization across multi-tenant GPU workloads, and operated infrastructure where downtime has real cost
Expert-level systems thinking: memory bandwidth, GPU profiling. You reason about hardware, not configs
Strong Python and genuine fluency with PyTorch internals. Enough to profile a training run and tell whether the bottleneck is data loading, communication, or compute
Track record of making infrastructure decisions that measurably improved training throughput or cost efficiency
Strong AI tooling skills. You use Claude Code, Cursor, or similar fluently to move fast without sacrificing quality

Bonus:

Experience operating at tens-of-millions-scale GPU spend
Multi-cloud or hybrid HPC/cloud infrastructure experience
Triton, CUDA, or custom kernel experience
Experience scaling from single cluster to multi-cluster orchestration
Background building experiment tracking, model registry, or ML pipeline tooling

About the role

About the Role

What you'll work on:

Own and evolve multi-cluster GPU infrastructure. Slurm on GCP today, multi-provider and new hardware tomorrow. Architecture, scheduling, reliability, cost optimization
Drive GPU utilization and training throughput: profiling, memory optimization, communication bottlenecks, systems-level debugging of distributed training across large runs
Architect the next generation of our infrastructure: multi-cluster orchestration, new GPU generations, provider diversification, capacity planning against growing compute demands
Build the developer productivity layer: CI pipelines, experiment tracking, model registry, data processing, and internal tooling that keeps research iteration speed high
Own the compute budget. You understand cost per FLOP across providers and hardware, and you hate wasted compute

Tech stack:

Slurm, GCP, Docker, wandb, GitHub Actions, uv, PyTorch, Triton

You may be a good fit if you have:

5+ years building and operating production GPU infrastructure or distributed training systems at scale. At a major AI lab, a well-funded ML startup, or an HPC environment
Deep hands-on experience with Slurm and cluster management. You've debugged scheduling failures, optimized utilization across multi-tenant GPU workloads, and operated infrastructure where downtime has real cost
Expert-level systems thinking: memory bandwidth, GPU profiling. You reason about hardware, not configs
Strong Python and genuine fluency with PyTorch internals. Enough to profile a training run and tell whether the bottleneck is data loading, communication, or compute
Track record of making infrastructure decisions that measurably improved training throughput or cost efficiency
Strong AI tooling skills. You use Claude Code, Cursor, or similar fluently to move fast without sacrificing quality

Bonus:

Experience operating at tens-of-millions-scale GPU spend
Multi-cloud or hybrid HPC/cloud infrastructure experience
Triton, CUDA, or custom kernel experience
Experience scaling from single cluster to multi-cluster orchestration
Background building experiment tracking, model registry, or ML pipeline tooling

Senior ML Infrastructure Engineer

About the role

About the Role

What you'll work on:

Tech stack:

You may be a good fit if you have:

Bonus:

Senior ML Infrastructure Engineer

About the role

About the Role

What you'll work on:

Tech stack:

You may be a good fit if you have:

Bonus:

Skills