remote
Site Reliability Engineer SRE - EPAM Systems
Site Reliability Engineer
Senior Site Reliability Engineer driving SRE best practices, monitoring, and SLO/SLI definition for a high‑frequency trading platform, ensuring reliability, performance, and rapid deployment cycles in a cloud environment.
About the role
Key Responsibilities
- Implement and champion DevOps and SRE best practices across the organization.
- Lead technology roadmap discussions for the SRE team, aligning with product and engineering goals.
- Define, craft, and maintain SLIs and SLOs, tracking key metrics such as MTTR, Lead Time for Change, Deployment Frequency, and Change Failure Rate.
- Design, develop, and manage monitoring, alerting, and incident response workflows to ensure platform stability.
- Collaborate with development teams to embed reliability into feature development and release pipelines.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Proficiency with cloud platforms (AWS preferred) and container orchestration (Kubernetes).
- Hands‑on experience with monitoring tools (Prometheus, Grafana, Datadog) and alerting systems.
- Strong scripting skills (Python, Bash) and CI/CD pipeline expertise.
- Excellent communication and collaboration skills in a fast‑paced, high‑availability environment.