remote
Senior Site Reliability Engineer - Okta
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability for Auth0 services, leveraging Kubernetes, AWS, Terraform, Go, and Python to build resilient, automated infrastructure and monitoring pipelines.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure for Auth0 services on Kubernetes and AWS.
- Develop and manage Terraform modules and CI/CD pipelines to automate deployments and rollbacks.
- Collaborate with Product and Quality Engineers to define reliability SLAs, SLOs, and error budgets.
- Implement observability solutions using Prometheus, Grafana, and distributed tracing to detect and remediate incidents.
- Lead post‑mortem analyses, root cause investigations, and continuous improvement initiatives.
Requirements
- 5+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with Kubernetes, AWS services (EKS, EC2, S3, CloudWatch), and Terraform.
- Strong scripting skills in Go or Python and experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Deep understanding of monitoring, alerting, and incident response best practices.
- Excellent communication skills and a collaborative mindset.
Skills
kubernetesawsterraformgopythoncicd