remote
Senior Site Reliability Engineer - AI Platform - Optum
Site Reliability Engineer
Senior SRE to design, build, and operate highly available AI platform services on AWS, leveraging Kubernetes, Terraform, and automation tools to ensure performance, reliability, and rapid delivery of machine‑learning workloads.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, fault‑tolerant infrastructure for AI/ML workloads on AWS.
- Develop and manage Kubernetes clusters, including networking, storage, and security configurations.
- Automate provisioning and configuration using Terraform and CI/CD pipelines (GitHub Actions, Jenkins, or similar).
- Implement observability solutions with Prometheus, Grafana, and CloudWatch to monitor performance, latency, and reliability.
- Collaborate with data scientists and software engineers to optimize resource usage and reduce time‑to‑model deployment.
- Participate in on‑call rotation, incident response, and post‑mortem analysis to continuously improve system resilience.
Requirements
- 5+ years of hands‑on SRE or DevOps experience in cloud environments, preferably AWS.
- Strong proficiency in Python for automation and scripting.
- Deep experience with Kubernetes orchestration, Helm charts, and container lifecycle management.
- Proven expertise in infrastructure‑as‑code using Terraform or CloudFormation.
- Solid understanding of monitoring, alerting, and logging frameworks (Prometheus, Grafana, CloudWatch, ELK).
Skills
pythonkubernetesawsterraformprometheuscicd