remoteonsite
Head of Site Reliability Engineering
Software Engineer
Lead the Site Reliability Engineering team to design, build, and operate highly available, scalable infrastructure for a global mobile eSports platform, driving automation, reliability, and performance using Kubernetes, AWS, and advanced monitoring tools.
About the role
Key Responsibilities
- Lead and mentor a high‑performing SRE team, setting vision and strategy for reliability, scalability, and automation across the platform.
- Architect and maintain production infrastructure on AWS, leveraging Kubernetes, Terraform, and CI/CD pipelines to ensure rapid, reliable deployments.
- Design and implement observability solutions (metrics, logs, traces) to detect, diagnose, and resolve incidents faster, driving a culture of blameless post‑mortems.
- Collaborate with product, engineering, and security teams to define SLAs, SLOs, and capacity planning for millions of concurrent mobile users.
- Champion continuous improvement initiatives, including chaos engineering, automated testing, and cost‑optimization strategies.
Requirements
- 10+ years of experience in large‑scale distributed systems, with 5+ years in a leadership role.
- Deep expertise in Kubernetes, AWS services (EKS, EC2, RDS, S3), and IaC tools like Terraform.
- Proven track record building CI/CD pipelines, monitoring stacks (Prometheus, Grafana, ELK), and incident response frameworks.
- Strong communication skills, able to translate technical concepts to non‑technical stakeholders.
- Passion for gaming and a user‑centric mindset, with a desire to innovate in a fast‑moving industry.