remote

Site Reliability Engineer Application Support - NBCUniversal

Site Reliability Engineer

Site Reliability Engineer focused on application support, ensuring high availability and performance of cloud‑native services using Kubernetes, AWS, and Python automation. Strong monitoring, incident response, and continuous improvement skills required.

About the role

Key Responsibilities

Maintain and troubleshoot production applications running on Kubernetes clusters in AWS, ensuring 99.9% uptime.
Implement and manage monitoring, alerting, and log aggregation solutions (Prometheus, Grafana, ELK) to detect and resolve incidents proactively.
Automate deployment pipelines and configuration management using Python scripts and IaC tools (Terraform, CloudFormation).
Collaborate with development teams to design resilient architectures, perform capacity planning, and conduct post‑mortem analyses.
Participate in on‑call rotations, providing rapid incident response and root cause analysis.

Requirements

3+ years of SRE or DevOps experience in a cloud environment.
Hands‑on expertise with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Strong scripting skills in Python and experience with CI/CD pipelines.
Proficiency in monitoring tools (Prometheus, Grafana, ELK) and incident management practices.
Excellent problem‑solving, communication, and teamwork abilities.

Skills

linuxkubernetesawspython

CompanyNBCUniversal

DepartmentSupport

LocationLondon, ENG, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026