remote
Senior Machine Learning Engineer, Reliability - Roblox
ML Engineer
Senior Machine Learning Engineer focused on building reliable, scalable ML systems for a global platform, leveraging Python, AWS, and distributed data pipelines to ensure high availability and performance.
About the role
Key Responsibilities
- Design, develop, and maintain end‑to‑end machine learning pipelines that run at scale on AWS infrastructure.
- Collaborate with reliability and infrastructure teams to implement robust monitoring, alerting, and automated rollback mechanisms for ML models.
- Optimize model training and inference workloads for latency, throughput, and cost efficiency across distributed clusters.
- Conduct rigorous A/B testing, performance benchmarking, and root‑cause analysis to continuously improve model quality and system resilience.
- Mentor junior engineers and contribute to best‑practice documentation for ML operations.
Requirements
- 5+ years of experience building production ML systems in a large‑scale environment.
- Proficiency in Python, PyTorch/TensorFlow, and experience with distributed training frameworks.
- Deep knowledge of AWS services (SageMaker, ECS/EKS, Lambda, S3, CloudWatch) and CI/CD pipelines.
- Strong background in data engineering, SQL, and working with large datasets.
- Excellent problem‑solving skills and a passion for building reliable, maintainable systems.
Skills
pythonmachine learningaws