remote
Senior SRE, Ads - reddit
Site Reliability Engineer
Senior SRE driving reliability and scalability of Reddit’s advertising platform using Kubernetes, AWS, Docker, and advanced monitoring to ensure high availability and rapid incident resolution.
About the role
Key Responsibilities
- Design, build, and maintain highly available, scalable infrastructure for the Ads platform on Kubernetes and AWS.
- Implement and evolve CI/CD pipelines, ensuring rapid, reliable deployments with zero downtime.
- Develop and maintain observability stack (metrics, logs, traces) to detect, diagnose, and resolve incidents proactively.
- Lead incident response, root cause analysis, and post‑mortem documentation to drive continuous improvement.
- Collaborate with product, engineering, and security teams to define reliability SLAs and enforce best practices.
Requirements
- 5+ years of SRE or DevOps experience in a large-scale, high‑traffic environment.
- Deep expertise with Kubernetes, Docker, and AWS services (EKS, EC2, RDS, S3).
- Proficient in scripting (Python, Bash) and infrastructure-as-code (Terraform, CloudFormation).
- Strong background in monitoring, alerting, and incident management (Prometheus, Grafana, PagerDuty).
- Excellent communication skills and a collaborative mindset.
Skills
kubernetesawsdockercicd