remote
Staff Site Reliability Engineer - Observability GCP - Okta
Site Reliability Engineer
Lead the design and expansion of a robust observability platform on GCP, driving reliability, performance, and incident response for a high‑scale identity service using Prometheus, Grafana, and Kubernetes.
About the role
Key Responsibilities
- Architect, implement, and maintain end‑to‑end observability solutions on Google Cloud Platform, including metrics, logs, and traces.
- Integrate Prometheus, Grafana, and Cloud Monitoring to provide real‑time visibility across distributed services.
- Collaborate with SRE, DevOps, and security teams to define SLIs, SLOs, and alerting strategies.
- Automate observability tooling and dashboards using IaC and CI/CD pipelines.
- Lead incident investigations, root‑cause analysis, and post‑mortem documentation to improve system resilience.
Requirements
- 5+ years of SRE or DevOps experience with a focus on observability.
- Deep expertise in Google Cloud Platform services (Monitoring, Logging, Cloud Trace).
- Hands‑on experience with Prometheus, Grafana, and Kubernetes monitoring.
- Strong scripting skills (Python, Bash) and familiarity with IaC tools (Terraform, Deployment Manager).
- Excellent communication and collaboration skills in a fast‑paced, cross‑functional environment.
Skills
prometheusgrafanakubernetes