remote

SRE Production Support - NBCUniversal

Site Reliability Engineer

Senior Site Reliability Engineer focused on production support for large-scale streaming and media services, leveraging Kubernetes, Docker, AWS, and advanced monitoring to ensure high availability and rapid incident resolution.

About the role

Key Responsibilities

Provide 24/7 production support for mission‑critical media and streaming services, ensuring uptime and performance targets are met.
Diagnose, triage, and resolve incidents across Kubernetes clusters, Docker containers, and AWS infrastructure, coordinating with development and operations teams.
Implement and maintain automated monitoring, alerting, and log aggregation solutions to detect and prevent outages.
Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce recurrence of incidents.
Develop and maintain runbooks, playbooks, and documentation for incident response and system operations.

Requirements

5+ years of experience in Site Reliability Engineering or production support roles.
Proficient with Linux system administration, Kubernetes, Docker, and AWS services (EC2, EKS, CloudWatch).
Strong scripting skills in Bash, Python, or Go for automation and tooling.
Hands‑on experience with monitoring/alerting platforms such as Prometheus, Grafana, or Datadog.
Excellent communication skills and ability to work collaboratively in a fast‑paced, cross‑functional environment.

Skills

linuxkubernetesdockeraws

CompanyNBCUniversal

DepartmentSupport

LocationLondon, ENG, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026