onsite
Staff Site Reliability Engineer AIOps - Palo Alto Networks
Site Reliability Engineer
Lead the design, implementation, and operation of AI‑driven reliability platforms, leveraging Python, Go, Kubernetes, and cloud services to deliver proactive monitoring and automated remediation at scale.
About the role
Key Responsibilities
- Architect and build highly available, automated SRE platforms that incorporate AIOps techniques for predictive incident detection and remediation.
- Develop and maintain observability pipelines using Prometheus, Grafana, and custom telemetry collectors written in Python or Go.
- Design, implement, and manage infrastructure-as-code solutions with Terraform on AWS, ensuring repeatable, secure, and compliant deployments.
- Collaborate with product and security engineering teams to embed reliability best practices into the software development lifecycle.
- Lead incident response, perform root‑cause analysis, and drive continuous improvement through post‑mortem reviews and automation.
Requirements
- 5+ years of experience in site reliability engineering or related roles, with a strong focus on automation and observability.
- Proficiency in programming/scripting languages such as Python and Go.
- Deep hands‑on experience with Kubernetes orchestration, container runtimes, and cloud platforms (AWS preferred).
- Expertise in monitoring stacks (Prometheus, Grafana) and infrastructure‑as‑code tools like Terraform.
- Demonstrated ability to apply machine‑learning or AIOps concepts to improve system reliability and reduce mean time to resolution.
Skills
pythongokubernetesprometheusterraformaws