remote
Lead AI Platform Engineer - Manulife
Devops Engineer
Lead AI Platform Engineer designing and scaling cloud infrastructures for AI model training, deployment, and management using AWS, Kubernetes, Docker, and Terraform, while collaborating with data scientists to optimize performance and cost.
About the role
Key Responsibilities
- Design, build, and maintain scalable AI platform architectures on AWS, ensuring high availability and performance for model training and inference workloads.
- Implement containerization (Docker) and orchestration (Kubernetes) pipelines, integrating CI/CD workflows for rapid model deployment.
- Collaborate closely with data scientists and data engineers to translate model requirements into production-ready infrastructure, including data pipelines and monitoring.
- Monitor system performance, troubleshoot incidents, and continuously optimize resource utilization for cost‑effectiveness.
- Stay current with emerging AI and cloud technologies, evaluating and adopting new tools to enhance platform capabilities.
Requirements
- 5+ years of experience in cloud engineering, with deep expertise in AWS services (EKS, ECS, S3, SageMaker).
- Proficient in Python, Docker, Kubernetes, Terraform, and CI/CD tools (GitHub Actions, Jenkins).
- Strong background in machine learning operations (MLOps) and experience deploying ML models at scale.
- Excellent problem‑solving skills and ability to work cross‑functionally with data teams.
- Strong communication skills and a proactive, collaborative mindset.
Skills
pythonawskubernetesdockermachine learningterraformcicd