remote
Site Reliability Engineer - Optum
Site Reliability Engineer
Drive the design, deployment, and operation of scalable cloud infrastructure for Optum Serve, leveraging Kubernetes, Docker, AWS, and Terraform to ensure high availability, performance, and security across commercial and government workloads.
About the role
Key Responsibilities
- Architect, implement, and maintain highly available Kubernetes clusters on AWS, ensuring seamless deployment of containerized services.
- Develop and manage Terraform modules for infrastructure as code, automating provisioning and lifecycle management.
- Implement robust monitoring, logging, and alerting solutions using Prometheus, Grafana, and CloudWatch to detect and remediate incidents proactively.
- Collaborate with development teams to integrate CI/CD pipelines, enforce best practices, and streamline release processes.
- Conduct capacity planning, performance tuning, and cost optimization across cloud resources.
- Respond to on‑call incidents, perform root cause analysis, and drive post‑mortem improvements.
Requirements
- 3+ years of experience in site reliability engineering or DevOps roles.
- Hands‑on expertise with Kubernetes, Docker, and AWS services (EKS, EC2, S3, RDS).
- Proficiency in Terraform and configuration management tools.
- Strong scripting skills in Bash or Python for automation.
- Excellent problem‑solving abilities and a collaborative mindset.
Skills
kubernetesdockerawsterraform