remote
Senior Site Reliability Engineer, Data - Comcast
Site Reliability Engineer
Senior Site Reliability Engineer focused on data platform reliability, scalability, and performance using Python, AWS, and Kubernetes to support global ad‑tech operations.
About the role
Key Responsibilities
- Design, implement, and maintain highly available data pipelines and services across AWS and on‑prem environments.
- Collaborate with data engineers to optimize data ingestion, processing, and storage for millions of events per day.
- Develop and enforce SLOs, SLAs, and automated monitoring/alerting using Prometheus, Grafana, and CloudWatch.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
- Automate deployment and configuration management with Terraform, Helm, and CI/CD pipelines.
Requirements
- 5+ years of SRE or DevOps experience in a data‑centric environment.
- Strong proficiency in Python and Bash scripting for automation.
- Hands‑on experience with AWS services (EC2, RDS, S3, Lambda) and Kubernetes clusters.
- Deep understanding of data engineering concepts, ETL workflows, and distributed storage systems.
- Excellent problem‑solving skills, communication, and a proactive approach to reliability.
Skills
pythonawskubernetes