onsite
Engineer 3, Site Reliability Engineer - Comcast
Site Reliability Engineer
Experienced Site Reliability Engineer responsible for designing, automating, and operating scalable cloud infrastructure, ensuring high availability and performance of critical media and technology services.
About the role
Key Responsibilities
- Design, build, and maintain highly available, fault‑tolerant services on AWS using infrastructure‑as‑code tools such as Terraform.
- Develop and support container orchestration platforms (Kubernetes) and automate deployment pipelines with CI/CD frameworks.
- Implement monitoring, alerting, and observability solutions (Prometheus, Grafana) to proactively detect and resolve incidents.
- Collaborate with development and product teams to improve reliability, performance, and scalability of applications.
- Lead incident response, perform root‑cause analysis, and drive post‑mortem improvements.
Requirements
- 5+ years of experience in site reliability or production engineering roles.
- Strong proficiency with Linux systems, scripting (Python or Bash), and cloud platforms (AWS).
- Hands‑on experience with Kubernetes, Terraform, and CI/CD tools (Jenkins, GitLab CI, or similar).
- Deep understanding of monitoring, logging, and alerting frameworks (Prometheus, Grafana, ELK).
- Proven track record of incident management, troubleshooting complex distributed systems, and driving automation.
Skills
linuxkubernetesterraformpythonawscicdprometheus