remote

Senior GCP SRE - ELK - EPAM Systems

Site Reliability Engineer

Lead the design, deployment, and operation of scalable GCP infrastructure, focusing on ELK stack observability and reliability. Drive automation, performance tuning, and incident response for a high‑availability B2B parts platform.

About the role

Key Responsibilities

Architect, implement, and maintain GCP-based infrastructure for a high‑traffic B2B parts platform, ensuring 99.99% uptime.
Design and operate ELK (Elasticsearch, Logstash, Kibana) clusters for real‑time log aggregation, monitoring, and alerting.
Automate provisioning and configuration using Terraform, Cloud Deployment Manager, and CI/CD pipelines.
Collaborate with development teams to embed SRE best practices, including error budgets, chaos engineering, and capacity planning.
Lead incident investigations, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.

Requirements

5+ years of experience in cloud operations, with a strong focus on GCP and SRE principles.
Hands‑on expertise with ELK stack, Kubernetes, and container orchestration.
Proficiency in IaC tools (Terraform, Cloud Deployment Manager) and CI/CD pipelines (GitLab CI, Cloud Build).
Excellent problem‑solving skills, strong communication, and a proactive, collaborative mindset.
Experience with monitoring, alerting, and incident management tools (Prometheus, Grafana, PagerDuty).

Skills

kubernetesterraform

CompanyEPAM Systems

DepartmentEngineering

LocationTelangana, India

Experience9+ years

Tenurefull-time

LevelSenior

Posted June 24, 2026