onsite

Senior Site Reliability Engineer AI Infrastructure - Andromeda Cluster

Site Reliability Engineer

Senior Site Reliability Engineer focused on AI infrastructure, ensuring high availability, scalability, and performance of ML workloads using AWS, Kubernetes, Docker, and advanced CI/CD pipelines.

About the role

Key Responsibilities

Design, deploy, and maintain highly available AI/ML infrastructure on AWS, leveraging Kubernetes and Docker for container orchestration.
Implement and manage CI/CD pipelines for model training, testing, and production deployment using Terraform, GitHub Actions, and ArgoCD.
Monitor system health, performance, and security, proactively addressing incidents and optimizing resource utilization.
Collaborate with data scientists and ML engineers to integrate model serving, versioning, and monitoring into the production stack.
Automate infrastructure provisioning, scaling, and disaster recovery procedures to support rapid experimentation and deployment cycles.

Requirements

5+ years of experience in site reliability engineering or DevOps with a focus on AI/ML workloads.
Proficiency in Python, AWS services (EKS, S3, Lambda), Kubernetes, Docker, and Terraform.
Strong background in CI/CD tooling, monitoring (Prometheus, Grafana), and incident response.
Experience with ML model deployment, versioning, and monitoring frameworks (MLflow, Seldon).
Excellent problem‑solving skills and a collaborative mindset.

Skills

pythonawskubernetesdockercicdterraform

CompanyAndromeda Cluster

DepartmentEngineering

LocationSan Francisco, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 18, 2026