onsite
LLM Engineer LLM Evaluation - 42dot
LLM Engineer
Lead the design and automation of large‑language‑model evaluation pipelines, building benchmark datasets, evaluation protocols, and end‑to‑end workflows on Kubernetes with MLflow and Argo Workflows to continuously improve model quality and reliability.
About the role
Key Responsibilities
- Design and maintain LLM evaluation frameworks, including benchmark datasets and evaluation metrics (human and LLM‑based).
- Develop and automate end‑to‑end evaluation pipelines on Kubernetes, integrating Argo Workflows and MLflow for experiment tracking and deployment validation.
- Establish fair comparison protocols to benchmark multiple LLMs, ensuring reproducibility and consistency across experiments.
- Collaborate with research and engineering teams to iterate on model improvements based on evaluation insights.
- Monitor and enhance the reliability of the evaluation platform, scaling resources and optimizing performance.
Requirements
- Strong experience with Python and data‑engineering tools for large‑scale model evaluation.
- Proficiency in Kubernetes, Argo Workflows, and MLflow for orchestrating and tracking experiments.
- Hands‑on experience designing benchmark datasets and evaluation metrics for LLMs.
- Excellent problem‑solving skills and ability to work in a fast‑moving, research‑driven environment.
- Effective communication skills to present findings to cross‑functional teams.
Skills
pythonkubernetesmlflow