remote

Lead Site Reliability Engineer - Canada - Vista

Site Reliability Engineer

Lead the Incident Response team to drive reliability across a cloud‑native platform, using Kubernetes, AWS, Terraform and Python to analyse failures, automate remediation, and embed observability best practices.

About the role

Key Responsibilities

Lead the Incident Response function, coordinating cross‑functional response to high‑severity outages and ensuring timely resolution.
Analyze incident data and post‑mortems to identify systemic failure patterns and root causes.
Design and implement automated remediation and reliability tooling using Python, Terraform, and cloud native services.
Partner with engineering squads to embed observability, alerting, and SLO/SLI frameworks into their services.
Mentor SRE team members, raise engineering standards, and drive continuous improvement of incident processes.

Requirements

5+ years of experience in Site Reliability Engineering or Incident Management on large‑scale, cloud‑native platforms.
Deep expertise with Kubernetes orchestration and AWS infrastructure services.
Proficiency in infrastructure‑as‑code (Terraform) and scripting/automation (Python).
Strong analytical skills to translate incident data into actionable reliability improvements.
Excellent communication and leadership abilities to influence technical teams and stakeholders.

Skills

kubernetesawsterraformpython

CompanyVista

DepartmentEngineering

LocationCanada

Experience7+ years

Tenurefull-time

LevelLead

Salary143,000

Posted June 20, 2026