remote

Site Reliability Engineer SRE - EPAM Systems

Site Reliability Engineer

Senior Site Reliability Engineer driving SRE best practices, monitoring, and SLO/SLI definition for a high‑frequency trading platform, ensuring reliability, performance, and rapid deployment cycles in a cloud environment.

About the role

Key Responsibilities

Implement and champion DevOps and SRE best practices across the organization.
Lead technology roadmap discussions for the SRE team, aligning with product and engineering goals.
Define, craft, and maintain SLIs and SLOs, tracking key metrics such as MTTR, Lead Time for Change, Deployment Frequency, and Change Failure Rate.
Design, develop, and manage monitoring, alerting, and incident response workflows to ensure platform stability.
Collaborate with development teams to embed reliability into feature development and release pipelines.

Requirements

5+ years of experience in Site Reliability Engineering or DevOps roles.
Proficiency with cloud platforms (AWS preferred) and container orchestration (Kubernetes).
Hands‑on experience with monitoring tools (Prometheus, Grafana, Datadog) and alerting systems.
Strong scripting skills (Python, Bash) and CI/CD pipeline expertise.
Excellent communication and collaboration skills in a fast‑paced, high‑availability environment.

Skills

cicdaws

CompanyEPAM Systems

DepartmentEngineering

LocationCanada

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026