onsite
Staff Site Reliability Engineer AIOps - PaloAlto Networks
Site Reliability Engineer
Lead the design and operation of highly available, AI‑driven infrastructure, driving automation, observability, and resilience across cloud platforms using Kubernetes, CI/CD pipelines, and Python scripting.
About the role
Key Responsibilities
- Architect, deploy, and maintain large‑scale, AI‑enhanced infrastructure on AWS/GCP, ensuring 99.99% uptime and rapid incident response.
- Develop and extend Kubernetes operators and Helm charts to automate application lifecycle and scaling.
- Implement end‑to‑end observability with Prometheus, Grafana, and custom AIOps dashboards, integrating ML models for anomaly detection.
- Lead incident management, root‑cause analysis, and post‑mortem processes, driving continuous improvement.
- Collaborate with DevOps, security, and product teams to embed reliability best practices into CI/CD pipelines.
Requirements
- 10+ years of SRE or DevOps experience, with 5+ in AI/ML operations.
- Deep expertise in Kubernetes, Helm, and container orchestration at scale.
- Proficient in Python, Bash, and infrastructure-as-code tools (Terraform, Pulumi).
- Strong background in cloud services (AWS, GCP) and CI/CD tooling (GitHub Actions, ArgoCD).
- Excellent communication skills and a proven track record of driving reliability and automation initiatives.
Skills
kubernetescicdpython