onsite
Senior Site Reliability Engineer AI Infrastructure - Andromeda Cluster
Site Reliability Engineer
Senior Site Reliability Engineer focused on AI infrastructure, ensuring high availability, scalability, and performance of ML workloads using AWS, Kubernetes, Docker, and advanced CI/CD pipelines.
About the role
Key Responsibilities
- Design, deploy, and maintain highly available AI/ML infrastructure on AWS, leveraging Kubernetes and Docker for container orchestration.
- Implement and manage CI/CD pipelines for model training, testing, and production deployment using Terraform, GitHub Actions, and ArgoCD.
- Monitor system health, performance, and security, proactively addressing incidents and optimizing resource utilization.
- Collaborate with data scientists and ML engineers to integrate model serving, versioning, and monitoring into the production stack.
- Automate infrastructure provisioning, scaling, and disaster recovery procedures to support rapid experimentation and deployment cycles.
Requirements
- 5+ years of experience in site reliability engineering or DevOps with a focus on AI/ML workloads.
- Proficiency in Python, AWS services (EKS, S3, Lambda), Kubernetes, Docker, and Terraform.
- Strong background in CI/CD tooling, monitoring (Prometheus, Grafana), and incident response.
- Experience with ML model deployment, versioning, and monitoring frameworks (MLflow, Seldon).
- Excellent problem‑solving skills and a collaborative mindset.
Skills
pythonawskubernetesdockercicdterraform