remote
Staff SRE, Ads - reddit
Site Reliability Engineer
Lead the reliability and scalability of Reddit’s advertising platform, driving automation, incident response, and performance optimization across Kubernetes and AWS environments.
About the role
Key Responsibilities
- Own end‑to‑end reliability of the Ads platform, ensuring high availability and rapid incident resolution.
- Design, implement, and maintain CI/CD pipelines that automate deployments across Kubernetes clusters in AWS.
- Develop and refine monitoring, alerting, and observability solutions to detect and mitigate performance bottlenecks.
- Collaborate with engineering, product, and security teams to embed reliability best practices into feature development.
- Lead post‑mortem analyses, driving actionable improvements and knowledge sharing across the organization.
Requirements
- 5+ years of SRE or DevOps experience in a large, distributed system.
- Deep expertise with Kubernetes, AWS services (EKS, EC2, S3, CloudWatch), and container orchestration.
- Proficiency in scripting (Python, Bash) and CI/CD tooling (GitHub Actions, ArgoCD, Jenkins).
- Strong background in monitoring, alerting, and incident response (Prometheus, Grafana, PagerDuty).
- Excellent communication skills and a collaborative mindset.