onsite
Staff Site Reliability Engineer Cloud Management - Palo Alto Networks
Site Reliability Engineer
Lead the design, deployment, and operation of scalable cloud infrastructure, ensuring high availability, performance, and security for enterprise‑grade services using Kubernetes, AWS, and Terraform.
About the role
Key Responsibilities
- Architect and maintain highly available, scalable cloud environments for mission‑critical applications.
- Implement and manage Kubernetes clusters, CI/CD pipelines, and infrastructure as code with Terraform.
- Design and enforce observability, monitoring, and alerting strategies to detect and resolve incidents quickly.
- Collaborate with development, security, and product teams to embed reliability best practices into the software delivery lifecycle.
- Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives.
Requirements
- 10+ years of experience in site reliability engineering or related roles.
- Deep expertise in AWS, Kubernetes, and Terraform.
- Strong scripting skills (Python, Bash) and familiarity with CI/CD tools (GitHub Actions, Jenkins).
- Proven track record of building and operating large‑scale, highly available systems.
- Excellent communication skills and a collaborative mindset.
Skills
kubernetesawsterraform