onsite

Senior Distributed Systems Engineer

Distributed Systems Engineer

The Senior Distributed Systems Engineer will optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This role focuses on performance engineering, distributed debugging, and communication-runtime co-design to improve scalability and fault tolerance across ultra-scale GPU supercomputing systems.

About the role

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology. This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

Design and optimize expert-parallel and hybrid-parallel communication patterns
Drive high-performance hierarchical collectives for MoE workloads
Co-design runtime orchestration with communication topology awareness
Reduce tail latency and improve determinism across thousands of GPUs
Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

Communication-compute overlap and topology-aware collective optimization
Deep debugging of NCCL, RDMA, and custom communication layers
Hybrid expert parallel strategies in modern large-scale MoE systems
Elastic and resilient distributed job orchestration concepts
Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

Hybrid expert parallel communication for Mixture-of-Experts training
Scaling behavior under network pressure
Distributed orchestration for elastic, large-scale training
Fault detection and recovery in distributed GPU workloads
Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
Deep familiarity with NCCL and/or UCX internals
Strong systems programming ability (C/C++, Rust, or Go)
Strong familiarity with modern model training frameworks such as PyTorch
Ability to troubleshoot and profile training performance issues related to communication bottlenecks
Ability to translate research ideas into production-grade optimizations
Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

You can explain why a communication degrades at scale and how to fix it
You have improved real cluster throughput via communication redesign
You can trace a distributed hang across ranks and identify the root cause
You are comfortable working at the boundary between hardware and runtime

Application Requirements

Include a link to your GitHub (required)
Provide links to relevant distributed systems, HPC, or large-scale training projects
Include a list of publications and/or public technical reports (if applicable)
Describe the hardest distributed debugging problem you solved
Include measurable performance improvements you have delivered

Academic Qualifications

Master’s, or Bachelor’s + 1 year of relevant experience.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

Comprehensive medical, dental, and vision benefits
Bonus
401K Plan
Generous paid time off, sick leave and holidays
Paid Parental Leave
Employee Assistance Program
Life insurance and disability

About the role

About the Institute of Foundation Models

The Mission

Design and optimize expert-parallel and hybrid-parallel communication patterns
Drive high-performance hierarchical collectives for MoE workloads
Co-design runtime orchestration with communication topology awareness
Reduce tail latency and improve determinism across thousands of GPUs
Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

Communication-compute overlap and topology-aware collective optimization
Deep debugging of NCCL, RDMA, and custom communication layers
Hybrid expert parallel strategies in modern large-scale MoE systems
Elastic and resilient distributed job orchestration concepts
Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

Hybrid expert parallel communication for Mixture-of-Experts training
Scaling behavior under network pressure
Distributed orchestration for elastic, large-scale training
Fault detection and recovery in distributed GPU workloads
Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
Deep familiarity with NCCL and/or UCX internals
Strong systems programming ability (C/C++, Rust, or Go)
Strong familiarity with modern model training frameworks such as PyTorch
Ability to troubleshoot and profile training performance issues related to communication bottlenecks
Ability to translate research ideas into production-grade optimizations
Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

You can explain why a communication degrades at scale and how to fix it
You have improved real cluster throughput via communication redesign
You can trace a distributed hang across ranks and identify the root cause
You are comfortable working at the boundary between hardware and runtime

Application Requirements

Include a link to your GitHub (required)
Provide links to relevant distributed systems, HPC, or large-scale training projects
Include a list of publications and/or public technical reports (if applicable)
Describe the hardest distributed debugging problem you solved
Include measurable performance improvements you have delivered

Academic Qualifications

Master’s, or Bachelor’s + 1 year of relevant experience.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

Comprehensive medical, dental, and vision benefits
Bonus
401K Plan
Generous paid time off, sick leave and holidays
Paid Parental Leave
Employee Assistance Program
Life insurance and disability

Senior Distributed Systems Engineer

About the role

About the Institute of Foundation Models

The Mission

Core Technical Scope

Expected Technical Depth

Required Background

What We Mean by "Hardcore"

Application Requirements

Academic Qualifications

Visa Sponsorship

Benefits Include

Senior Distributed Systems Engineer

About the role

About the Institute of Foundation Models

The Mission

Core Technical Scope

Expected Technical Depth

Required Background

What We Mean by "Hardcore"

Application Requirements

Academic Qualifications

Visa Sponsorship

Benefits Include

Skills