onsite

LLM Evaluation Engineer / Generative AI Quality Engineer - W2 Only

Generative AI Engineer

Senior AI Engineer to design and implement robust evaluation frameworks for generative AI systems, focusing on prompt robustness, model reliability, and agentic AI behavior testing.

About the role

Key Responsibilities

Design and develop automated evaluation pipelines for LLM and agentic AI systems
Create evaluation scenarios and adversarial test datasets to identify model edge cases and bias
Assess AI outputs using metrics such as task success rate, semantic similarity, and sentiment analysis
Analyze and debug agent reasoning, tool usage, and action sequences to identify failure points
Develop advanced prompt engineering strategies to test reasoning, planning, and instruction adherence
Build and maintain custom evaluation frameworks using Python

Requirements

6+ years of hands-on experience in AI model evaluation and debugging
Strong proficiency in Python, automation scripts, and evaluation frameworks
Deep understanding of large language model behaviors, failure modes, and quality validation techniques
Experience with AI failure mode analysis (hallucination, incoherence, jailbreaking)
Familiarity with automated testing frameworks (Pytest) and evaluation metrics

Skills

pythonprompt engineeringgenerative aiai quality validationllm evaluation frameworksci cd integration

DepartmentEngineering

LocationSunnyvale, California, United States

Experience7+ years

Tenurefull-time

LevelSenior

Posted April 12, 2026