onsite

Staff Site Reliability Engineer AIOps - Palo Alto Networks

Site Reliability Engineer

Lead the design, implementation, and operation of AI‑driven reliability platforms, leveraging Python, Go, Kubernetes, and cloud services to deliver proactive monitoring and automated remediation at scale.

About the role

Key Responsibilities

Architect and build highly available, automated SRE platforms that incorporate AIOps techniques for predictive incident detection and remediation.
Develop and maintain observability pipelines using Prometheus, Grafana, and custom telemetry collectors written in Python or Go.
Design, implement, and manage infrastructure-as-code solutions with Terraform on AWS, ensuring repeatable, secure, and compliant deployments.
Collaborate with product and security engineering teams to embed reliability best practices into the software development lifecycle.
Lead incident response, perform root‑cause analysis, and drive continuous improvement through post‑mortem reviews and automation.

Requirements

5+ years of experience in site reliability engineering or related roles, with a strong focus on automation and observability.
Proficiency in programming/scripting languages such as Python and Go.
Deep hands‑on experience with Kubernetes orchestration, container runtimes, and cloud platforms (AWS preferred).
Expertise in monitoring stacks (Prometheus, Grafana) and infrastructure‑as‑code tools like Terraform.
Demonstrated ability to apply machine‑learning or AIOps concepts to improve system reliability and reduce mean time to resolution.

Skills

pythongokubernetesprometheusterraformaws

CompanyPalo Alto Networks

DepartmentEngineering

LocationSanta Clara, CA, United States

Experience7+ years

Tenurefull-time

LevelLead

Salary169,225

Posted June 19, 2026