hybrid

Senior Staff ML Systems Engineer (TLM), ML Infrastructure

The Senior Staff ML Systems Engineer (TLM) will lead the development and efficient deployment of large-scale machine learning models using advanced AI infrastructure. This role involves improving model efficiency on different platforms, driving model system co-design, and applying expertise in model optimizations and advanced algorithms for efficient execution on various hardware compute platforms.

About the role

About the Role

The Waymo ML Infrastructure team accelerates Waymo’s mission by building the best ecosystem for sustainably innovating and shipping ML-powered intelligence. Research, Production, and Hardware teams are our primary stakeholders, and our work powers the development of state-of-the-art models in Perception and Trajectory planning, core to our autonomous driving software. We enable our partners by offering best-in-class solutions for the entire model development lifecycle, including understanding model business goals and platform hardware characteristics, and co-designing models for the hardware. These solutions are developed in close collaboration with different modeling teams, with scale and efficiency as core tenets.

We are looking for an experienced senior TLM to join our team. In this critical role, you will lead the development and enable efficient deployment for large-scale machine learning models using state-of-the-art advanced AI infrastructure. You will work cross-functionally at the intersection of data engineering, model development, and Datacenter + on-device low-latency deployments, ensuring seamless integration across teams and technologies to power efficient innovation.

Responsibilities

Technical Leadership: Proactively study the SOTA model architectures and optimizations from the community and Google, for World Models, Diffusion + flow matching techniques, and translate them into measurable technical deliverables in Waymo’s onboard driving stack.
Performance Analysis: Dev tooling innovation for model performance inspector in highly distributed training/inference setups, apply roofline analysis, understand the efficiency headrooms and drive work groups to deliver the optimizations and meet the system requirements.
Strong Execution: Innovate high-performance optimizations and tools for various models and large-scale training/inference including on future next-gen TPUs and low-bit precision training/inference setup, and ensure all system components align towards achieving high performance and goodput goals.
Cross-Team Leadership: Guide efforts across multiple teams and organizations to ensure seamless integration of data generation, model development, and deployment pipelines.
Mentorship & Management: Act as a mentor to junior engineers, helping to grow their technical expertise and foster a culture of collaboration and engineering excellence. Manage the IC performance for a medium size team of ~10 engineers.

Requirements

10+ years of professional software engineering experience, with at least 5 years in machine learning infrastructure such as developing, training, deploying, and optimizing large-scale machine learning systems.
Experienced using ML accelerator profiling tools to uncover performance bottlenecks.
Solid experience in the development and optimization of machine learning infrastructure tools like DeepSpeed, PyTorch, TensorFlow, JAX, or similar frameworks.
Deep understanding of state-of-the-art machine learning models and architectures such as autoregressive and diffusion transformers and familiarity with custom-kernels for diverse h/w compute based efficiency.
Strong leadership skills with experience navigating cross-functional teams and providing technical leadership projects across multiple organizations.
Excellent communication skills, both verbal and written, with the ability to translate complex technical concepts for a broad audience.
A Master’s or PhD in Computer Science, Engineering, or a related field is preferred.

About the role

About the Role

Responsibilities

Technical Leadership: Proactively study the SOTA model architectures and optimizations from the community and Google, for World Models, Diffusion + flow matching techniques, and translate them into measurable technical deliverables in Waymo’s onboard driving stack.
Performance Analysis: Dev tooling innovation for model performance inspector in highly distributed training/inference setups, apply roofline analysis, understand the efficiency headrooms and drive work groups to deliver the optimizations and meet the system requirements.
Strong Execution: Innovate high-performance optimizations and tools for various models and large-scale training/inference including on future next-gen TPUs and low-bit precision training/inference setup, and ensure all system components align towards achieving high performance and goodput goals.
Cross-Team Leadership: Guide efforts across multiple teams and organizations to ensure seamless integration of data generation, model development, and deployment pipelines.
Mentorship & Management: Act as a mentor to junior engineers, helping to grow their technical expertise and foster a culture of collaboration and engineering excellence. Manage the IC performance for a medium size team of ~10 engineers.

Requirements

10+ years of professional software engineering experience, with at least 5 years in machine learning infrastructure such as developing, training, deploying, and optimizing large-scale machine learning systems.
Experienced using ML accelerator profiling tools to uncover performance bottlenecks.
Solid experience in the development and optimization of machine learning infrastructure tools like DeepSpeed, PyTorch, TensorFlow, JAX, or similar frameworks.
Deep understanding of state-of-the-art machine learning models and architectures such as autoregressive and diffusion transformers and familiarity with custom-kernels for diverse h/w compute based efficiency.
Strong leadership skills with experience navigating cross-functional teams and providing technical leadership projects across multiple organizations.
Excellent communication skills, both verbal and written, with the ability to translate complex technical concepts for a broad audience.
A Master’s or PhD in Computer Science, Engineering, or a related field is preferred.

Senior Staff ML Systems Engineer (TLM), ML Infrastructure

About the role

About the Role

Responsibilities

Requirements

Senior Staff ML Systems Engineer (TLM), ML Infrastructure

About the role

About the Role

Responsibilities

Requirements

Skills