remote

Principal Site Reliability Engineer - Okta

Site Reliability Engineer

Lead the design, implementation, and operation of highly available, secure identity infrastructure using Kubernetes, AWS, and Terraform, driving automation, observability, and resilience for a global identity platform.

About the role

Key Responsibilities

Architect and maintain large‑scale, highly available Kubernetes clusters that support millions of identity transactions worldwide.
Design and implement end‑to‑end CI/CD pipelines using Go, Python, and Terraform to automate deployments, scaling, and configuration management.
Develop and maintain observability solutions—metrics, logs, and traces—using Prometheus, Grafana, and OpenTelemetry to ensure rapid incident detection and resolution.
Collaborate with security, product, and development teams to embed SRE best practices, including chaos engineering, capacity planning, and post‑mortem analysis.
Lead incident response, root cause analysis, and continuous improvement initiatives to reduce MTTR and increase system reliability.

Requirements

10+ years of experience in site reliability engineering or a related field, with a proven track record of managing mission‑critical services.
Deep expertise in Kubernetes, AWS, and Terraform for infrastructure as code.
Strong programming skills in Go and Python, with experience building automation tools and microservices.
Hands‑on experience with CI/CD, monitoring, and alerting platforms (Prometheus, Grafana, OpenTelemetry).
Excellent communication skills and a collaborative mindset to work across distributed teams.

Skills

kubernetesawsterraformgopythoncicd

CompanyOkta

DepartmentEngineering

LocationKarnataka, India

Experience7+ years

Tenurefull-time

LevelLead

Posted June 20, 2026