onsite
Site Reliability Engineering Manager - Apple
Engineering Manager
Lead a high‑performing SRE team to design, automate, and operate scalable, reliable services using Kubernetes, Terraform, and cloud platforms while driving incident response, performance monitoring, and continuous delivery.
About the role
Key Responsibilities
- Lead, mentor, and grow a team of SRE engineers, fostering a culture of reliability, automation, and continuous improvement.
- Design and implement highly available, scalable architectures on AWS using Kubernetes, Terraform, and related cloud native technologies.
- Develop and maintain observability stacks (Prometheus, Grafana, logging) to proactively detect and resolve performance issues.
- Own incident management end‑to‑end: on‑call rotation, root‑cause analysis, post‑mortems, and process enhancements.
- Drive CI/CD pipelines and infrastructure‑as‑code practices to accelerate safe deployments.
- Collaborate with product, engineering, and security teams to embed reliability standards throughout the software lifecycle.
Requirements
- 5+ years of hands‑on SRE or DevOps experience, with at least 2 years in a people‑management role.
- Deep expertise in Kubernetes, Terraform, and AWS services.
- Proficiency in programming/scripting languages such as Go and Python for automation.
- Strong background in monitoring, alerting, and incident response using tools like Prometheus and Grafana.
- Experience building and maintaining CI/CD pipelines and infrastructure‑as‑code workflows.
Skills
kubernetesterraformprometheusgopythonawscicd