remote
Lead Site Reliability Engineer, Data - Comcast
Site Reliability Engineer
Lead the reliability and scalability of large‑scale data platforms, driving automation, monitoring, and incident response using Python, Kubernetes, Terraform, AWS, and modern CI/CD practices.
About the role
Key Responsibilities
- Design, implement, and maintain highly available data pipelines and storage services across cloud and on‑premise environments.
- Develop automation scripts and infrastructure‑as‑code (Terraform, CI/CD) to streamline provisioning, deployment, and scaling of data workloads.
- Build and operate monitoring, alerting, and observability stacks (Prometheus, Grafana, logs) to ensure performance SLAs and rapid incident resolution.
- Collaborate with data engineers, product owners, and security teams to define reliability standards, capacity planning, and disaster‑recovery strategies.
- Lead incident response, root‑cause analysis, and post‑mortem processes, driving continuous improvement and knowledge sharing across the organization.
Requirements
- 5+ years of SRE or DevOps experience managing large‑scale data platforms (databases, data lakes, streaming services).
- Strong proficiency in Python for automation and tooling.
- Deep hands‑on experience with Kubernetes, Terraform, and AWS services (EC2, S3, RDS, EMR).
- Expertise in monitoring and observability tools such as Prometheus, Grafana, and centralized logging solutions.
- Solid understanding of SQL databases, data modeling, and performance tuning.
Skills
pythonkubernetesterraformawssqlprometheuscicd