remote

Lead Site Reliability Engineer, Data - Comcast

Site Reliability Engineer

Lead the reliability and scalability of large‑scale data platforms, driving automation, monitoring, and incident response using Python, Kubernetes, Terraform, AWS, and modern CI/CD practices.

About the role

Key Responsibilities

Design, implement, and maintain highly available data pipelines and storage services across cloud and on‑premise environments.
Develop automation scripts and infrastructure‑as‑code (Terraform, CI/CD) to streamline provisioning, deployment, and scaling of data workloads.
Build and operate monitoring, alerting, and observability stacks (Prometheus, Grafana, logs) to ensure performance SLAs and rapid incident resolution.
Collaborate with data engineers, product owners, and security teams to define reliability standards, capacity planning, and disaster‑recovery strategies.
Lead incident response, root‑cause analysis, and post‑mortem processes, driving continuous improvement and knowledge sharing across the organization.

Requirements

5+ years of SRE or DevOps experience managing large‑scale data platforms (databases, data lakes, streaming services).
Strong proficiency in Python for automation and tooling.
Deep hands‑on experience with Kubernetes, Terraform, and AWS services (EC2, S3, RDS, EMR).
Expertise in monitoring and observability tools such as Prometheus, Grafana, and centralized logging solutions.
Solid understanding of SQL databases, data modeling, and performance tuning.

Skills

pythonkubernetesterraformawssqlprometheuscicd

CompanyComcast

DepartmentEngineering

LocationReston, Virginia, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 25, 2026