remote
SRE Operations Engineer - Okta
Site Reliability Engineer
SRE Operations Engineer driving reliability for AI‑centric identity services using Kubernetes, AWS, Terraform, and advanced monitoring to ensure 99.99% uptime and rapid incident resolution.
About the role
Key Responsibilities
- Design, implement, and maintain highly available Kubernetes clusters for identity services.
- Automate infrastructure provisioning and configuration using Terraform and AWS services.
- Develop and refine monitoring, alerting, and incident response workflows to meet stringent uptime SLAs.
- Collaborate with development teams to embed reliability best practices into CI/CD pipelines.
- Lead post‑mortem analyses, root cause investigations, and continuous improvement initiatives.
Requirements
- 5+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with Kubernetes, AWS, Terraform, and container orchestration.
- Strong scripting skills (Python, Bash) and experience with monitoring tools (Prometheus, Grafana, Datadog).
- Hands‑on incident response and post‑mortem culture.
- Excellent communication and collaboration skills across distributed teams.
Skills
kubernetesawsterraform