remote

Staff Site Reliability Engineer - Observability - Okta

Site Reliability Engineer

Lead the design, implementation, and scaling of observability platforms for a global identity cloud, using Prometheus, Grafana, Kubernetes, Go, and AWS to ensure high reliability and performance.

About the role

Key Responsibilities

Architect and operate end‑to‑end observability solutions (metrics, tracing, logging) for a large‑scale identity platform.
Develop and maintain reusable instrumentation libraries in Go and Python, integrating with Prometheus, Grafana, and OpenTelemetry.
Collaborate with product and engineering teams to define SLIs/SLOs, create alerting policies, and drive incident response automation.
Manage Kubernetes clusters and related infrastructure on AWS, ensuring observability tooling is highly available and cost‑effective.
Mentor junior SREs, promote best practices, and contribute to continuous improvement of reliability processes.

Requirements

7+ years of SRE or DevOps experience with large‑scale, cloud‑native systems.
Deep expertise in observability stacks (Prometheus, Grafana, OpenTelemetry) and performance tuning.
Strong programming skills in Go (or Python) and infrastructure as code (Terraform, CloudFormation).
Hands‑on experience managing Kubernetes workloads on AWS, including networking, security, and scaling.
Proven track record of defining and meeting reliability targets (SLIs/SLOs) in production environments.

Skills

prometheusgrafanakubernetesgoaws

CompanyOkta

DepartmentEngineering

LocationKarnataka, India

Experience7+ years

Tenurefull-time

LevelLead

Posted June 25, 2026