remote
Senior GCP SRE - ELK - EPAM Systems
Site Reliability Engineer
Lead the design, deployment, and operation of scalable GCP infrastructure, focusing on ELK stack observability and reliability. Drive automation, performance tuning, and incident response for a high‑availability B2B parts platform.
About the role
Key Responsibilities
- Architect, implement, and maintain GCP-based infrastructure for a high‑traffic B2B parts platform, ensuring 99.99% uptime.
- Design and operate ELK (Elasticsearch, Logstash, Kibana) clusters for real‑time log aggregation, monitoring, and alerting.
- Automate provisioning and configuration using Terraform, Cloud Deployment Manager, and CI/CD pipelines.
- Collaborate with development teams to embed SRE best practices, including error budgets, chaos engineering, and capacity planning.
- Lead incident investigations, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
Requirements
- 5+ years of experience in cloud operations, with a strong focus on GCP and SRE principles.
- Hands‑on expertise with ELK stack, Kubernetes, and container orchestration.
- Proficiency in IaC tools (Terraform, Cloud Deployment Manager) and CI/CD pipelines (GitLab CI, Cloud Build).
- Excellent problem‑solving skills, strong communication, and a proactive, collaborative mindset.
- Experience with monitoring, alerting, and incident management tools (Prometheus, Grafana, PagerDuty).
Skills
kubernetesterraform