onsite
ML Platform Engineer - Stitch Fix
Devops Engineer
Build and scale machine‑learning infrastructure on cloud platforms, enabling data scientists to deploy models efficiently using Python, TensorFlow, Kubernetes, and AWS while ensuring robust CI/CD pipelines and big‑data processing with Spark.
About the role
Key Responsibilities
- Design, develop, and maintain a scalable ML platform that supports end‑to‑end model training, serving, and monitoring.
- Implement infrastructure as code using Kubernetes and AWS services to ensure high availability and cost‑effective scaling.
- Build CI/CD pipelines for automated testing, containerization, and deployment of ML workloads.
- Collaborate with data scientists to integrate frameworks such as TensorFlow and PyTorch into production pipelines.
- Optimize data processing workflows with Spark and manage data pipelines for feature engineering.
- Monitor system performance, troubleshoot issues, and continuously improve platform reliability and security.
Requirements
- 5+ years of experience in software engineering or ML platform development.
- Strong proficiency in Python and experience with deep‑learning libraries (TensorFlow, PyTorch).
- Hands‑on experience with Kubernetes, Docker, and AWS (EKS, S3, SageMaker, etc.).
- Proven ability to build CI/CD pipelines and automate deployments.
- Experience processing large datasets using Spark or similar big‑data technologies.
Skills
pythontensorflowkubernetesawscicd