remote

Senior Site Reliability Engineer - AI Platform - Optum

Site Reliability Engineer

Senior SRE to design, build, and operate highly available AI platform services on AWS, leveraging Kubernetes, Terraform, and automation tools to ensure performance, reliability, and rapid delivery of machine‑learning workloads.

About the role

Key Responsibilities

Design, implement, and maintain scalable, fault‑tolerant infrastructure for AI/ML workloads on AWS.
Develop and manage Kubernetes clusters, including networking, storage, and security configurations.
Automate provisioning and configuration using Terraform and CI/CD pipelines (GitHub Actions, Jenkins, or similar).
Implement observability solutions with Prometheus, Grafana, and CloudWatch to monitor performance, latency, and reliability.
Collaborate with data scientists and software engineers to optimize resource usage and reduce time‑to‑model deployment.
Participate in on‑call rotation, incident response, and post‑mortem analysis to continuously improve system resilience.

Requirements

5+ years of hands‑on SRE or DevOps experience in cloud environments, preferably AWS.
Strong proficiency in Python for automation and scripting.
Deep experience with Kubernetes orchestration, Helm charts, and container lifecycle management.
Proven expertise in infrastructure‑as‑code using Terraform or CloudFormation.
Solid understanding of monitoring, alerting, and logging frameworks (Prometheus, Grafana, CloudWatch, ELK).

Skills

pythonkubernetesawsterraformprometheuscicd

CompanyOptum

DepartmentEngineering

LocationKarnataka, India

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 25, 2026