remote
Sr Site Reliability Engineer - Cloud, AI Operations, SaaS, PaaS - Optum
Site Reliability Engineer
Senior Site Reliability Engineer focused on cloud‑native AI operations for SaaS and PaaS platforms, driving reliability, automation, and performance at scale.
About the role
Key Responsibilities
- Design, implement, and maintain highly available cloud infrastructure for AI‑driven SaaS and PaaS services.
- Develop and manage CI/CD pipelines, ensuring rapid, reliable deployments across multi‑cloud environments.
- Implement observability, monitoring, and alerting solutions to detect and remediate incidents proactively.
- Collaborate with development, security, and product teams to embed SRE best practices into the software lifecycle.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve system resilience.
Requirements
- 5+ years of experience in site reliability engineering or related roles.
- Proficiency with Kubernetes, Docker, and cloud platforms (AWS, Azure, or GCP).
- Strong scripting skills in Python and experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Hands‑on experience with monitoring/observability stacks (Prometheus, Grafana, ELK, or similar).
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetescicdpython