remote
Lead Site Reliability Engineer - Canada - Vista
Site Reliability Engineer
Lead the Incident Response team to drive reliability across a cloud‑native platform, using Kubernetes, AWS, Terraform and Python to analyse failures, automate remediation, and embed observability best practices.
About the role
Key Responsibilities
- Lead the Incident Response function, coordinating cross‑functional response to high‑severity outages and ensuring timely resolution.
- Analyze incident data and post‑mortems to identify systemic failure patterns and root causes.
- Design and implement automated remediation and reliability tooling using Python, Terraform, and cloud native services.
- Partner with engineering squads to embed observability, alerting, and SLO/SLI frameworks into their services.
- Mentor SRE team members, raise engineering standards, and drive continuous improvement of incident processes.
Requirements
- 5+ years of experience in Site Reliability Engineering or Incident Management on large‑scale, cloud‑native platforms.
- Deep expertise with Kubernetes orchestration and AWS infrastructure services.
- Proficiency in infrastructure‑as‑code (Terraform) and scripting/automation (Python).
- Strong analytical skills to translate incident data into actionable reliability improvements.
- Excellent communication and leadership abilities to influence technical teams and stakeholders.
Skills
kubernetesawsterraformpython