remote
Senior Site Reliability Engineer - synthesia
Site Reliability Engineer
Lead the design, deployment, and operation of scalable, highly available services on Kubernetes and AWS, ensuring reliability, performance, and rapid incident response for a global AI video platform.
About the role
Key Responsibilities
- Architect and maintain highly available, scalable infrastructure on Kubernetes and AWS, ensuring 99.99% uptime for mission‑critical services.
- Design and implement CI/CD pipelines, automated testing, and blue‑green deployments to accelerate feature delivery while minimizing risk.
- Monitor system health with Prometheus, Grafana, and custom alerts; conduct post‑mortem analysis and drive continuous improvement.
- Collaborate with development, security, and product teams to embed observability, resilience, and cost‑efficiency into every release.
- Lead incident response, root‑cause analysis, and knowledge‑sharing sessions to elevate team expertise.
Requirements
- 5+ years of SRE or DevOps experience in a high‑scale, cloud‑native environment.
- Proficiency with Kubernetes, Docker, Helm, and Terraform for infrastructure as code.
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (GitHub Actions, ArgoCD, Jenkins).
- Hands‑on experience with AWS services (EKS, EC2, S3, CloudWatch) and monitoring tools (Prometheus, Grafana).
- Excellent problem‑solving, communication, and collaboration skills in a fast‑moving, cross‑functional team.
Skills
kubernetesdockercicdawsprometheusgrafanaterraform