remote
Staff Site Reliability Operations Engineer - Calix
Systems Engineer
Lead the design, implementation, and operation of scalable, highly available cloud services on AWS using Kubernetes, Terraform, and Python, driving automation, reliability, and performance for a cloud‑first, AI‑powered platform.
About the role
Key Responsibilities
- Architect and maintain highly available, scalable Kubernetes clusters on AWS, ensuring zero downtime and optimal resource utilization.
- Develop and manage IaC pipelines with Terraform, automating infrastructure provisioning and configuration across multiple environments.
- Implement robust CI/CD workflows, integrating automated testing, security scanning, and blue‑green deployments for rapid, reliable releases.
- Design and maintain observability stack (Prometheus, Grafana, Loki, etc.) to provide real‑time metrics, logs, and alerts, driving proactive incident response.
- Collaborate with development, security, and product teams to define SLOs, SLIs, and incident management processes, fostering a culture of reliability.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles, with deep expertise in Kubernetes and AWS.
- Proficient in Terraform, Python, and CI/CD tooling (GitHub Actions, ArgoCD, Jenkins).
- Strong background in monitoring, logging, and alerting solutions (Prometheus, Grafana, Loki, ELK).
- Excellent problem‑solving skills and a proactive approach to automation and process improvement.
- Effective communication skills and ability to work cross‑functionally in a fast‑paced environment.
Skills
kubernetesawsterraformpythoncicd