onsite
Manager, SRE - airasia
Site Reliability Engineer
Lead a high‑performing Site Reliability Engineering team, driving reliability, automation, and cloud‑native operations across multi‑cloud environments using Kubernetes, CI/CD pipelines, and advanced monitoring tools.
About the role
Key Responsibilities
- Lead and mentor a team of SREs to design, build, and maintain highly available, scalable services across AWS and GCP.
- Architect and implement CI/CD pipelines, infrastructure as code, and automated deployment workflows.
- Define and enforce reliability SLAs, SLOs, and error budgets, driving continuous improvement.
- Oversee incident response, root‑cause analysis, and post‑mortem processes to reduce MTTR.
- Collaborate with development, security, and product teams to embed reliability best practices into the software lifecycle.
Requirements
- 5+ years of SRE/DevOps experience in a fast‑paced, cloud‑native environment.
- Proficiency with Kubernetes, Docker, and container orchestration at scale.
- Strong scripting skills (Python, Bash) and experience with IaC tools (Terraform, CloudFormation).
- Hands‑on experience with monitoring/observability stacks (Prometheus, Grafana, ELK, Datadog).
- Excellent communication, leadership, and problem‑solving abilities.