onsite

Staff Software Engineer, AI Runtime

Software Engineer, AI Runtime

As a Staff Software Engineer for AI Runtime at Databricks, you will drive the architecture and evolution of a managed GPU training platform, enabling scalable and resilient large-scale AI model training. This role involves solving complex problems in distributed training, optimizing GPU performance, and enhancing developer experience for cutting-edge AI teams.

About the role

About the Role

As a Staff Software Engineer for AI Runtime (AIR) at Databricks, you will be instrumental in building and scaling the systems that make large-scale GPU training fast, reliable, and effortless. AIR is a managed platform for large-scale GPU training and fine-tuning, offering on-demand access to accelerator fleets and a serverless experience for multi-node job orchestration. You will drive the architecture and evolution of the managed GPU training stack, encompassing scheduling, capacity management, distributed training performance, fault tolerance, and enhancing the developer experience for launching and operating jobs at scale. Beyond direct contributions to core systems, you will help define the long-term technical vision for AIR, mentor senior engineers, collaborate with product, research, and platform teams, and lead initiatives that expand the technical and business impact of custom training at Databricks.

The Impact You Will Have

Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.
Lead end-to-end engineering efforts, from design through production rollout, holding a high bar for performance, correctness, and reliability.
Make direct, high-impact contributions to the core systems behind AIR, and help bring up support for the latest accelerators and new regions as the fleet grows.
Champion engineering excellence, mentor other engineers through design reviews and technical discussions, and help shape Databricks' long-term technical direction in AI training infrastructure.

What We Look For

10+ years of experience building and operating large-scale distributed systems, with significant depth in GPU training infrastructure, high-performance computing, or ML systems.
Hands-on experience with distributed training frameworks (such as PyTorch, FSDP, DeepSpeed, or Megatron) and the parallelism strategies (data, tensor, pipeline, and sequence parallelism) used to train large models.
Strong understanding of training resilience patterns, including checkpointing, failure detection, and automatic recovery for long-running, multi-node jobs.
Solid grasp of GPU performance fundamentals, including accelerator architecture, high-speed interconnects (such as NVLink and InfiniBand or RoCE), collective communication, and the bottlenecks that govern training throughput and utilization.
Experience building and operating managed, multi-tenant platform products in the cloud, with clear SLAs and SLOs for availability, performance, and reliability.
Strong foundation in algorithms, data structures, and system design as applied to performance-sensitive, large-scale distributed systems.
Proven ability to deliver technically complex, high-impact initiatives that create clear customer or business value.
Strong communication skills and the ability to collaborate across product, research, and infrastructure teams in a fast-moving environment.
Strategic, product-oriented mindset with the ability to align technical execution to a long-term vision, and a passion for mentoring engineers and fostering technical excellence.
BS in Computer Science or a related field (MS or PhD preferred).

About the role

About the Role

The Impact You Will Have

Drive the architecture and evolution of AIR's managed GPU training platform, delivering scalable, high-throughput, and resilient training across fleets that span thousands of accelerators.
Solve the hardest problems in large-scale training, including multi-node orchestration, distributed parallelism strategies, GPU scheduling and dynamic routing, high-throughput data loading, and checkpoint and restore for very long-running jobs.
Push GPU efficiency and training performance, raising utilization (such as model FLOPs utilization and end-to-end throughput) and lowering cost per training run across diverse model architectures and hardware generations.
Build the resilience and observability foundations that keep multi-node jobs healthy, detecting and recovering from hardware and software failures with minimal disruption to customers.
Partner with product, research, and platform teams to shape the APIs, CLI, and developer experience that make it easy to launch, monitor, and debug production training jobs.
Lead end-to-end engineering efforts, from design through production rollout, holding a high bar for performance, correctness, and reliability.
Make direct, high-impact contributions to the core systems behind AIR, and help bring up support for the latest accelerators and new regions as the fleet grows.
Champion engineering excellence, mentor other engineers through design reviews and technical discussions, and help shape Databricks' long-term technical direction in AI training infrastructure.

What We Look For

10+ years of experience building and operating large-scale distributed systems, with significant depth in GPU training infrastructure, high-performance computing, or ML systems.
Hands-on experience with distributed training frameworks (such as PyTorch, FSDP, DeepSpeed, or Megatron) and the parallelism strategies (data, tensor, pipeline, and sequence parallelism) used to train large models.
Strong understanding of training resilience patterns, including checkpointing, failure detection, and automatic recovery for long-running, multi-node jobs.
Solid grasp of GPU performance fundamentals, including accelerator architecture, high-speed interconnects (such as NVLink and InfiniBand or RoCE), collective communication, and the bottlenecks that govern training throughput and utilization.
Experience building and operating managed, multi-tenant platform products in the cloud, with clear SLAs and SLOs for availability, performance, and reliability.
Strong foundation in algorithms, data structures, and system design as applied to performance-sensitive, large-scale distributed systems.
Proven ability to deliver technically complex, high-impact initiatives that create clear customer or business value.
Strong communication skills and the ability to collaborate across product, research, and infrastructure teams in a fast-moving environment.
Strategic, product-oriented mindset with the ability to align technical execution to a long-term vision, and a passion for mentoring engineers and fostering technical excellence.
BS in Computer Science or a related field (MS or PhD preferred).

Staff Software Engineer, AI Runtime

About the role

About the Role

The Impact You Will Have

What We Look For

Staff Software Engineer, AI Runtime

About the role

About the Role

The Impact You Will Have

What We Look For

Skills