About the Role
As an MLOps Engineer in the DAMO service line at Thoughtworks, you will be responsible for ensuring the reliability, safety, performance, and continuous improvement of large-scale machine learning and AI systems in production. This includes both generative AI and traditional ML systems like computer vision and recommendation models. You will work across the full software delivery lifecycle, contributing to design, implementation, deployment, and ongoing operational excellence.
You will champion engineering best practices, including clean and maintainable code, test-driven development, continuous delivery, strong observability, and collaborative development through pairing and code reviews. You will stay hands-on, actively contributing to codebases and applying modern practices from the Thoughtworks Technology Radar. You will design pragmatic solutions that balance technical constraints, cost efficiency, performance, and system safety. Working closely with developers, data scientists, platform engineers, and product teams, you will help deliver production-ready AI capabilities that meet business needs and uphold a high bar for quality.
You will also play an active role in fostering a collaborative, inclusive team culture, encouraging feedback and supporting the growth of team members.
Responsibilities
- Design, implement, and maintain monitoring and alerting for ML and AI operational signals, including model performance degradation (e.g., computer vision, recommendation, GenAI), data drift, latency issues, and anomalies. This includes specific monitoring for GenAI aspects like prompt failures, hallucination trends, guardrail violations, and overall agent workflow health.
- Build and operate robust evaluation and testing pipelines for all ML and AI systems, including automated regression tests for models, prompts, workflows, tools, and model versions, ensuring new releases meet or exceed established baselines.
- Investigate and resolve production issues related to model behavior, including troubleshooting ML models, tool-calling errors, vector search/RAG retrieval failures (for GenAI), data quality issues, and integration points.
- Collaborate with infrastructure and platform teams to ensure stable, performant, and cost-efficient AI inference, including optimization of deployment strategies, resource usage, and runtime configurations.
- Manage the lifecycle of ML models, prompts, embeddings, vector indices, and associated components, including controlled rollouts, versioning strategies, and automated evaluation gates.
- Design and operate effective feedback loops that incorporate real user interactions, evaluation metrics, UAT findings, and domain expert reviews, enabling continuous improvement of all ML/AI systems, including agentic systems.
- Uphold governance, safety, and compliance standards, ensuring observability, auditability, privacy protection, and adherence to organizational guidelines for all ML/AI systems and data handling.
- Maintain clear, comprehensive documentation covering operational procedures, system behaviors, incident findings, performance benchmarks, and deployment practices.
- Communicate system health, risks, upcoming changes, and operational insights clearly to technical and non-technical audiences.
- Support the growth and development of junior team members through guidance, knowledge sharing, and constructive feedback.
Job Qualifications
Technical Skills
- High proficiency in Python (Pandas, NumPy, Scikit-learn) for scripting, analysis, and maintaining production models.
- Strong SQL skills for querying, data manipulation, and operational data checks.
- Experience building or maintaining GenAI / agentic solutions (e.g., RAG, LlamaIndex, CrewAI, or similar orchestration/RAG tooling).
- Solid understanding of classical ML algorithms, model evaluation, and challenges like drift and bias.
- Hands-on experience with model monitoring (data quality, prediction quality, latency) using Prometheus, Grafana, or cloud-native tools.
- Experience with Azure (Databricks, Azure Machine Learning, etc.) for deployment and resource management; familiarity with GCP/AWS is a plus.
- Familiarity with Agile methodologies (Scrum/Kanban).
- Nice to have: Experience with big data frameworks (Spark, Dask) for large-scale processing.
- Nice to have: Understanding of containerization/orchestration such as Docker and basic Kubernetes.
- Nice to have: Exposure to workflow/pipeline or IaC tooling (Airflow, Kubeflow, MLflow, Terraform).
Professional Skills
- Strong influence and advocacy for technical excellence while adapting to change when necessary.
- Strong analytics and troubleshooting ability.
- Excellent communication and articulation skills.
- Ability to navigate ambiguity and tackle challenges from multiple perspectives.
- Experience mentoring junior consultants.
- Willingness to be part of a 24x7 on-call rotation, as needed.