hybrid
Sr. MLOps Engineer
Sr. MLOps Engineer
Betterdata is seeking a Senior MLOps Engineer to transform cutting-edge research into production-ready services for synthetic data generation and optimize ML algorithms at enterprise scale. The role involves building and tuning end-to-end model pipelines, ensuring high performance, scalability, and reliability across diverse workloads and dataset sizes, with a focus on algorithm optimization, data handling at scale, and end-to-end orchestration.
About the role
Who are We Looking for
We seek an experienced Machine Learning Engineer (Senior) to transform cutting-edge research into robust, production-ready services for synthetic data generation and to optimize both deep learning and classical ML algorithms (e.g. tree-based models) at enterprise scale (billions of rows). You will build and tune model pipelines end-to-end, ensuring high performance, scalability, and reliability across diverse workloads and dataset sizes.
Key Responsibilities
Algorithm Optimization & Scaling
- Optimize bottlenecks of the deep generative models to accelerate training and generation of generative models (e.g. transformer, diffusion, GANs).
- Implement distributed training of the models across multi-GPU clusters.
- Optimize distributed training of traditional ML models (e.g. XGBoost, LightGBM, CatBoost) on billion-row datasets.
- Design best practices for memory management to maximize resource utilization (compute and memory), enabling faster training at lower cost.
Data Handling at Scale
- Collaborate with data engineers to design ETL/ELT workflows handling terabyte to petabyte scale tabular and unstructured data.
- Implement scalable feature engineering pipelines using distributed computing frameworks (e.g. Spark, Dask, or Ray).
- Automate data validation (e.g. schema checks, anomaly detection) with rule-based and ML-driven frameworks.
End to end orchestration
- Build ML pipelines that transition research prototypes into reliable production-grade workflow.
- Package models into Docker containers and deploy using Kubernetes.
- Build automated model and data quality monitoring and validation systems to ensure data integrity throughout the pipeline lifecycle.
- Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
- Implement logging, monitoring and alerting systems.
Qualifications
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, Software Engineering, Data Science or a related quantitative discipline.
- 5+ years of hands-on experience optimizing and scaling machine learning models in production environments.
- Demonstrated track record of accelerating model training workflows (e.g., transformers, diffusion models, GANs) at multi-GPU scale.
- Experience in operating ETL/ELT pipelines handling terabytes to petabytes of tabular and unstructured data using distributed computing tools (e.g. Apache Spark, Dask, Ray).
- Demonstrated ability to translate research prototypes into reliable, production-grade ML pipelines with rigorous testing and validation.
- Experience in the ML orchestration (e.g. airflow, dagster).
Good to Have
- Experience hosting models to scalable cloud infrastructure (AWS / Azure / GCP).
- Experience containerisation of the data pipelines & AI models in docker with supporting orchestration tools (e.g. kubernetes).