remote
SRE Production Support - NBCUniversal
Site Reliability Engineer
Senior Site Reliability Engineer focused on production support for large-scale streaming and media services, leveraging Kubernetes, Docker, AWS, and advanced monitoring to ensure high availability and rapid incident resolution.
About the role
Key Responsibilities
- Provide 24/7 production support for mission‑critical media and streaming services, ensuring uptime and performance targets are met.
- Diagnose, triage, and resolve incidents across Kubernetes clusters, Docker containers, and AWS infrastructure, coordinating with development and operations teams.
- Implement and maintain automated monitoring, alerting, and log aggregation solutions to detect and prevent outages.
- Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce recurrence of incidents.
- Develop and maintain runbooks, playbooks, and documentation for incident response and system operations.
Requirements
- 5+ years of experience in Site Reliability Engineering or production support roles.
- Proficient with Linux system administration, Kubernetes, Docker, and AWS services (EC2, EKS, CloudWatch).
- Strong scripting skills in Bash, Python, or Go for automation and tooling.
- Hands‑on experience with monitoring/alerting platforms such as Prometheus, Grafana, or Datadog.
- Excellent communication skills and ability to work collaboratively in a fast‑paced, cross‑functional environment.
Skills
linuxkubernetesdockeraws