onsite

Software Engineer - Model Evaluation & Benchmarking

Software Engineer

Lead the design and implementation of automated evaluation pipelines for multimodal AI models, ensuring reliability and quality across image, video, and generative systems using Python and advanced benchmarking techniques.

About the role

Key Responsibilities

Design and maintain end‑to‑end evaluation pipelines for multimodal generative and vision‑based models.
Develop automated benchmarking suites that assess realism, consistency, and quality across image, video, and multimodal outputs.
Collaborate with applied science, infrastructure, and product teams to define evaluation metrics and data requirements.
Implement dataset‑driven testing frameworks and performance validation pipelines, integrating with CI/CD workflows.
Analyze evaluation results, provide actionable insights, and drive continuous improvement of model quality.

Requirements

Strong programming skills in Python with experience building scalable data pipelines.
Hands‑on experience with machine learning model evaluation, benchmarking, and metrics definition.
Familiarity with multimodal AI systems (image, video, text) and generative models.
Proficiency in version control, containerization (Docker), and CI/CD practices.
Excellent analytical and communication skills, able to translate technical findings into product‑ready insights.

Skills

pythonmachine learning

DepartmentEngineering

LocationSan Francisco, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026