remote
Staff Site Reliability Engineer - Infra - Okta
Site Reliability Engineer
Lead the design, build, and operation of highly scalable, secure infrastructure on AWS and GCP, driving reliability, automation, and incident response for production systems.
About the role
Key Responsibilities
- Design, build, and operate highly scalable, reliable, and secure infrastructure powering production systems across AWS and GCP.
- Lead major reliability initiatives, including capacity planning, performance tuning, and cost optimization.
- Implement and maintain CI/CD pipelines, infrastructure as code (Terraform), and automated testing to accelerate delivery.
- Develop and enforce monitoring, alerting, and incident response processes to ensure 99.99% uptime.
- Collaborate with cross‑functional teams to define SLAs, SLOs, and error budgets.
Requirements
- 5+ years of SRE or DevOps experience in large, distributed environments.
- Deep expertise with AWS and GCP services, Kubernetes, and Terraform.
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Proven track record of building observability solutions (Prometheus, Grafana, ELK) and incident management.
- Excellent communication, problem‑solving, and collaboration skills.
Skills
awsgcpkubernetesterraformcicd