remote

Principal Site Reliability Engineer - Optum

Site Reliability Engineer

Lead the design and scaling of reliability practices for large‑scale cloud platforms, driving automation, monitoring, and incident response to ensure high availability and performance.

About the role

Key Responsibilities

Architect and implement SRE frameworks across multi‑cloud environments, ensuring reliability, scalability, and cost efficiency.
Design and maintain automated monitoring, alerting, and incident response pipelines using modern observability tools.
Lead root‑cause analysis and post‑mortem processes, translating findings into actionable improvements.
Collaborate with development, security, and product teams to embed reliability into the CI/CD pipeline.
Mentor and coach engineering teams on SRE best practices, tooling, and culture.

Requirements

10+ years of experience in large‑scale cloud operations, with deep expertise in Kubernetes and container orchestration.
Proven track record of building and scaling monitoring, alerting, and incident response systems.
Strong scripting and automation skills (Python, Bash, Terraform).
Experience with CI/CD pipelines, GitOps, and cloud cost optimization.
Excellent communication skills and a collaborative mindset.

Skills

kubernetescicd

CompanyOptum

DepartmentEngineering

LocationMinnetonka, MN, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 19, 2026