onsite
LLM Evaluation Engineer / Generative AI Quality Engineer - W2 Only
Generative AI Engineer
Senior AI Engineer to design and implement robust evaluation frameworks for generative AI systems, focusing on prompt robustness, model reliability, and agentic AI behavior testing.
About the role
Key Responsibilities
- Design and develop automated evaluation pipelines for LLM and agentic AI systems
- Create evaluation scenarios and adversarial test datasets to identify model edge cases and bias
- Assess AI outputs using metrics such as task success rate, semantic similarity, and sentiment analysis
- Analyze and debug agent reasoning, tool usage, and action sequences to identify failure points
- Develop advanced prompt engineering strategies to test reasoning, planning, and instruction adherence
- Build and maintain custom evaluation frameworks using Python
Requirements
- 6+ years of hands-on experience in AI model evaluation and debugging
- Strong proficiency in Python, automation scripts, and evaluation frameworks
- Deep understanding of large language model behaviors, failure modes, and quality validation techniques
- Experience with AI failure mode analysis (hallucination, incoherence, jailbreaking)
- Familiarity with automated testing frameworks (Pytest) and evaluation metrics
Skills
pythonprompt engineeringgenerative aiai quality validationllm evaluation frameworksci cd integration