About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology. This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads. This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
- Design and optimize expert-parallel and hybrid-parallel communication patterns
- Drive high-performance hierarchical collectives for MoE workloads
- Co-design runtime orchestration with communication topology awareness
- Reduce tail latency and improve determinism across thousands of GPUs
- Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
- Communication-compute overlap and topology-aware collective optimization
- Deep debugging of NCCL, RDMA, and custom communication layers
- Hybrid expert parallel strategies in modern large-scale MoE systems
- Elastic and resilient distributed job orchestration concepts
- Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
- Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
- Hybrid expert parallel communication for Mixture-of-Experts training
- Scaling behavior under network pressure
- Distributed orchestration for elastic, large-scale training
- Fault detection and recovery in distributed GPU workloads
- Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
- Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
- Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
- Deep familiarity with NCCL and/or UCX internals
- Strong systems programming ability (C/C++, Rust, or Go)
- Strong familiarity with modern model training frameworks such as PyTorch
- Ability to troubleshoot and profile training performance issues related to communication bottlenecks
- Ability to translate research ideas into production-grade optimizations
- Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
- You can explain why a communication degrades at scale and how to fix it
- You have improved real cluster throughput via communication redesign
- You can trace a distributed hang across ranks and identify the root cause
- You are comfortable working at the boundary between hardware and runtime
Application Requirements
- Include a link to your GitHub (required)
- Provide links to relevant distributed systems, HPC, or large-scale training projects
- Include a list of publications and/or public technical reports (if applicable)
- Describe the hardest distributed debugging problem you solved
- Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.
Visa Sponsorship
This position is eligible for visa sponsorship.
Benefits Include
- Comprehensive medical, dental, and vision benefits
- Bonus
- 401K Plan
- Generous paid time off, sick leave and holidays
- Paid Parental Leave
- Employee Assistance Program
- Life insurance and disability