remote
Staff Site Reliability Engineer - Observability - Okta
Site Reliability Engineer
Lead the design, implementation, and scaling of observability platforms for a global identity cloud, using Prometheus, Grafana, Kubernetes, Go, and AWS to ensure high reliability and performance.
About the role
Key Responsibilities
- Architect and operate end‑to‑end observability solutions (metrics, tracing, logging) for a large‑scale identity platform.
- Develop and maintain reusable instrumentation libraries in Go and Python, integrating with Prometheus, Grafana, and OpenTelemetry.
- Collaborate with product and engineering teams to define SLIs/SLOs, create alerting policies, and drive incident response automation.
- Manage Kubernetes clusters and related infrastructure on AWS, ensuring observability tooling is highly available and cost‑effective.
- Mentor junior SREs, promote best practices, and contribute to continuous improvement of reliability processes.
Requirements
- 7+ years of SRE or DevOps experience with large‑scale, cloud‑native systems.
- Deep expertise in observability stacks (Prometheus, Grafana, OpenTelemetry) and performance tuning.
- Strong programming skills in Go (or Python) and infrastructure as code (Terraform, CloudFormation).
- Hands‑on experience managing Kubernetes workloads on AWS, including networking, security, and scaling.
- Proven track record of defining and meeting reliability targets (SLIs/SLOs) in production environments.
Skills
prometheusgrafanakubernetesgoaws