onsite

Staff Site Reliability Engineer AIOps - PaloAlto Networks

Site Reliability Engineer

Lead the design and operation of highly available, AI‑driven infrastructure, driving automation, observability, and resilience across cloud platforms using Kubernetes, CI/CD pipelines, and Python scripting.

About the role

Key Responsibilities

Architect, deploy, and maintain large‑scale, AI‑enhanced infrastructure on AWS/GCP, ensuring 99.99% uptime and rapid incident response.
Develop and extend Kubernetes operators and Helm charts to automate application lifecycle and scaling.
Implement end‑to‑end observability with Prometheus, Grafana, and custom AIOps dashboards, integrating ML models for anomaly detection.
Lead incident management, root‑cause analysis, and post‑mortem processes, driving continuous improvement.
Collaborate with DevOps, security, and product teams to embed reliability best practices into CI/CD pipelines.

Requirements

10+ years of SRE or DevOps experience, with 5+ in AI/ML operations.
Deep expertise in Kubernetes, Helm, and container orchestration at scale.
Proficient in Python, Bash, and infrastructure-as-code tools (Terraform, Pulumi).
Strong background in cloud services (AWS, GCP) and CI/CD tooling (GitHub Actions, ArgoCD).
Excellent communication skills and a proven track record of driving reliability and automation initiatives.

Skills

kubernetescicdpython

CompanyPaloAlto Networks

DepartmentEngineering

LocationCalifornia, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 23, 2026