remote

Senior Staff Machine Learning Engineer, Generative AI (Evaluation & Data Flywheel)

Airbnb is seeking a Senior Staff Machine Learning Engineer to lead the technical direction and execution of ML evaluation and the end-to-end data flywheel for CSxAI products. This role involves defining evaluation strategies, building scalable frameworks, and driving continuous improvement for Generative AI systems in a customer support context.

About the role

Role Overview

Airbnb was founded in 2007 and has since grown to over 5 million hosts welcoming over 2 billion guest arrivals globally. AI and ML are central to Airbnb's product, utilized across various functions from Trust and Payments to Customer Service and Marketing, ensuring optimal experiences for guests and hosts.

The Core ML team focuses on driving CSxAI (Customer Support x Artificial Intelligence) initiatives by integrating Generative AI technologies. This aims to create an intelligent, scalable, and exceptional service experience. The team develops and enhances AI models, ML services, and tools, including LLM fine-tuning and optimization, RAG/Search, LLM evaluation and testing automation, feedback-based learning, and guardrails for diverse applications at Airbnb. The complexity of Airbnb's data and marketplace demands state-of-the-art AI practices, with a commitment to long-term innovation.

The Difference You Will Make

In this Senior Staff role, you will define the technical direction and lead the execution for ML evaluation and the end-to-end data flywheel that powers CSxAI products, such as assistive agents, issue resolution, and tooling. Your contributions will be critical in defining quality measurement, converting feedback into learning signals, and continuously improving models and products safely and efficiently. You will collaborate closely with product, engineering, and design teams to build evaluation systems that are trusted, scalable, and actionable, linking offline metrics to online outcomes.

A Typical Day

Define evaluation strategy and success metrics for GenAI systems, ensuring alignment between offline evaluation and online business/customer experience outcomes.
Build and scale evaluation frameworks (e.g., golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge) with robust controls for bias, drift, and reliability.
Design the data flywheel, encompassing instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement.
Lead cross-functional quality initiatives across product, operations, and engineering, establishing clarity on quality standards and how teams should act on evaluation results.
Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy).
Drive technical decisions and architecture for evaluation and data infrastructure, balancing speed, rigor, cost, and safety.

Minimum Qualifications

Educational Background: PhD in Computer Science, Mathematics, Statistics, or a related technical field (or equivalent practical experience).
Industry Experience: 10+ years of experience in building, testing, and shipping ML/AI systems end-to-end; including 2+ years of experience with GenAI/LLM systems in production.
Leadership Experience: 5+ years leading large, ambiguous technical initiatives as a senior Individual Contributor (IC), influencing roadmap and engineering/science direction across teams.
Technical Proficiency:
- Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing).
- Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, memory, etc.
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance).
- Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management).

Preferred Qualifications

Customer Support Systems: Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA).
Infrastructure & Quality at Scale: Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement.
Agile Practice for Applied AI: Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor.

About the role

Role Overview

The Difference You Will Make

A Typical Day

Define evaluation strategy and success metrics for GenAI systems, ensuring alignment between offline evaluation and online business/customer experience outcomes.
Build and scale evaluation frameworks (e.g., golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge) with robust controls for bias, drift, and reliability.
Design the data flywheel, encompassing instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement.
Lead cross-functional quality initiatives across product, operations, and engineering, establishing clarity on quality standards and how teams should act on evaluation results.
Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy).
Drive technical decisions and architecture for evaluation and data infrastructure, balancing speed, rigor, cost, and safety.

Minimum Qualifications

Educational Background: PhD in Computer Science, Mathematics, Statistics, or a related technical field (or equivalent practical experience).
Industry Experience: 10+ years of experience in building, testing, and shipping ML/AI systems end-to-end; including 2+ years of experience with GenAI/LLM systems in production.
Leadership Experience: 5+ years leading large, ambiguous technical initiatives as a senior Individual Contributor (IC), influencing roadmap and engineering/science direction across teams.
Technical Proficiency:
- Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing).
- Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, memory, etc.
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance).
- Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management).

Preferred Qualifications

Customer Support Systems: Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA).
Infrastructure & Quality at Scale: Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement.
Agile Practice for Applied AI: Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor.

Senior Staff Machine Learning Engineer, Generative AI (Evaluation & Data Flywheel)

About the role

Role Overview

The Difference You Will Make

A Typical Day

Minimum Qualifications

Preferred Qualifications

Senior Staff Machine Learning Engineer, Generative AI (Evaluation & Data Flywheel)

About the role

Role Overview

The Difference You Will Make

A Typical Day

Minimum Qualifications

Preferred Qualifications

Skills