remote

Staff Site Reliability Engineer - Observability GCP - Okta

Site Reliability Engineer

Lead the design and expansion of a robust observability platform on GCP, driving reliability, performance, and incident response for a high‑scale identity service using Prometheus, Grafana, and Kubernetes.

About the role

Key Responsibilities

Architect, implement, and maintain end‑to‑end observability solutions on Google Cloud Platform, including metrics, logs, and traces.
Integrate Prometheus, Grafana, and Cloud Monitoring to provide real‑time visibility across distributed services.
Collaborate with SRE, DevOps, and security teams to define SLIs, SLOs, and alerting strategies.
Automate observability tooling and dashboards using IaC and CI/CD pipelines.
Lead incident investigations, root‑cause analysis, and post‑mortem documentation to improve system resilience.

Requirements

5+ years of SRE or DevOps experience with a focus on observability.
Deep expertise in Google Cloud Platform services (Monitoring, Logging, Cloud Trace).
Hands‑on experience with Prometheus, Grafana, and Kubernetes monitoring.
Strong scripting skills (Python, Bash) and familiarity with IaC tools (Terraform, Deployment Manager).
Excellent communication and collaboration skills in a fast‑paced, cross‑functional environment.

Skills

prometheusgrafanakubernetes

CompanyOkta

DepartmentEngineering

LocationBellevue, WA, United States

Experience7+ years

Tenurefull-time

LevelLead

Salary267,000

Posted June 19, 2026