remote
Site Reliability Engineer Application Support - NBCUniversal
Site Reliability Engineer
Site Reliability Engineer focused on application support, ensuring high availability and performance of cloud‑native services using Kubernetes, AWS, and Python automation. Strong monitoring, incident response, and continuous improvement skills required.
About the role
Key Responsibilities
- Maintain and troubleshoot production applications running on Kubernetes clusters in AWS, ensuring 99.9% uptime.
- Implement and manage monitoring, alerting, and log aggregation solutions (Prometheus, Grafana, ELK) to detect and resolve incidents proactively.
- Automate deployment pipelines and configuration management using Python scripts and IaC tools (Terraform, CloudFormation).
- Collaborate with development teams to design resilient architectures, perform capacity planning, and conduct post‑mortem analyses.
- Participate in on‑call rotations, providing rapid incident response and root cause analysis.
Requirements
- 3+ years of SRE or DevOps experience in a cloud environment.
- Hands‑on expertise with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Strong scripting skills in Python and experience with CI/CD pipelines.
- Proficiency in monitoring tools (Prometheus, Grafana, ELK) and incident management practices.
- Excellent problem‑solving, communication, and teamwork abilities.
Skills
linuxkubernetesawspython