remote
Senior Site Reliability Engineer, Kubernetes w/ active TS/SCI - Okta
Site Reliability Engineer
Senior Site Reliability Engineer focused on Kubernetes infrastructure, driving reliability, performance, and security across cloud environments using CI/CD pipelines and advanced monitoring tools.
About the role
Key Responsibilities
- Design, implement, and maintain highly available Kubernetes clusters across multi‑cloud environments (AWS, GCP, Azure).
- Develop and manage CI/CD pipelines to automate deployment, scaling, and rollbacks of containerized services.
- Implement robust monitoring, logging, and alerting solutions (Prometheus, Grafana, ELK) to ensure 99.99% uptime.
- Collaborate with security teams to enforce compliance, vulnerability scanning, and secure configuration best practices.
- Lead incident response, root cause analysis, and post‑mortem documentation to continuously improve reliability.
Requirements
- 5+ years of SRE or DevOps experience with a strong focus on Kubernetes.
- Proficiency in at least one major cloud provider and experience with IaC (Terraform, CloudFormation).
- Hands‑on scripting in Python or Go for automation and tooling.
- Deep understanding of monitoring, alerting, and observability principles.
- Excellent communication skills and a proactive, problem‑solving mindset.