remote
Staff Site Reliability Engineer - Okta
Site Reliability Engineer
Lead the design, build, and operation of highly scalable, secure infrastructure on AWS and GCP, driving reliability, performance, and automation for production systems.
About the role
Key Responsibilities
- Design, build, and operate highly scalable, reliable, and secure infrastructure powering production systems across AWS and GCP.
- Lead major reliability initiatives, including capacity planning, performance tuning, and cost optimization.
- Implement and maintain CI/CD pipelines, automated testing, and deployment workflows to accelerate feature delivery.
- Develop and enforce observability practices—metrics, logging, tracing—to detect, diagnose, and resolve incidents quickly.
- Collaborate with cross‑functional teams to define SLAs, SLOs, and incident response procedures.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Deep expertise with AWS and GCP services (EC2, ECS, EKS, GKE, Cloud Run, Cloud Functions).
- Proficiency in Kubernetes, container orchestration, and infrastructure as code (Terraform, CloudFormation).
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Hands‑on experience with monitoring, alerting, and incident management tools (Prometheus, Grafana, PagerDuty, Datadog).
Skills
awsgcpkubernetescicd