remote
Principal Site Reliability Engineer - Optum
Site Reliability Engineer
Lead the design and scaling of reliability practices for large‑scale cloud platforms, driving automation, monitoring, and incident response to ensure high availability and performance.
About the role
Key Responsibilities
- Architect and implement SRE frameworks across multi‑cloud environments, ensuring reliability, scalability, and cost efficiency.
- Design and maintain automated monitoring, alerting, and incident response pipelines using modern observability tools.
- Lead root‑cause analysis and post‑mortem processes, translating findings into actionable improvements.
- Collaborate with development, security, and product teams to embed reliability into the CI/CD pipeline.
- Mentor and coach engineering teams on SRE best practices, tooling, and culture.
Requirements
- 10+ years of experience in large‑scale cloud operations, with deep expertise in Kubernetes and container orchestration.
- Proven track record of building and scaling monitoring, alerting, and incident response systems.
- Strong scripting and automation skills (Python, Bash, Terraform).
- Experience with CI/CD pipelines, GitOps, and cloud cost optimization.
- Excellent communication skills and a collaborative mindset.