remote
Principal Site Reliability Engineer - Okta
Site Reliability Engineer
Lead the design, implementation, and operation of highly available, secure identity infrastructure using Kubernetes, AWS, and Terraform, driving automation, observability, and resilience for a global identity platform.
About the role
Key Responsibilities
- Architect and maintain large‑scale, highly available Kubernetes clusters that support millions of identity transactions worldwide.
- Design and implement end‑to‑end CI/CD pipelines using Go, Python, and Terraform to automate deployments, scaling, and configuration management.
- Develop and maintain observability solutions—metrics, logs, and traces—using Prometheus, Grafana, and OpenTelemetry to ensure rapid incident detection and resolution.
- Collaborate with security, product, and development teams to embed SRE best practices, including chaos engineering, capacity planning, and post‑mortem analysis.
- Lead incident response, root cause analysis, and continuous improvement initiatives to reduce MTTR and increase system reliability.
Requirements
- 10+ years of experience in site reliability engineering or a related field, with a proven track record of managing mission‑critical services.
- Deep expertise in Kubernetes, AWS, and Terraform for infrastructure as code.
- Strong programming skills in Go and Python, with experience building automation tools and microservices.
- Hands‑on experience with CI/CD, monitoring, and alerting platforms (Prometheus, Grafana, OpenTelemetry).
- Excellent communication skills and a collaborative mindset to work across distributed teams.
Skills
kubernetesawsterraformgopythoncicd