onsite
Senior Site Reliability Engineer, Storage SRE - Apple
Site Reliability Engineer
Senior SRE leader driving reliability for large‑scale storage and analytics platforms, leveraging Kubernetes, Go, Python, Prometheus and AWS to design, automate, and optimize high‑performance infrastructure.
About the role
Key Responsibilities
- Lead the design, deployment, and operation of highly available storage and analytics services supporting petabyte‑scale workloads.
- Apply SRE principles to define SLIs/SLOs, implement robust monitoring with Prometheus, and drive incident response and post‑mortem processes.
- Develop automation and tooling in Go and Python to improve provisioning, configuration management, and self‑service capabilities.
- Collaborate with cross‑functional engineering teams to optimize performance, capacity planning, and cost efficiency on AWS.
- Mentor junior SREs and partner engineers, fostering a culture of reliability, continuous improvement, and knowledge sharing.
Requirements
- 5+ years of experience in site reliability or production engineering for large‑scale distributed systems.
- Strong proficiency in Go and Python for building automation and observability tools.
- Deep hands‑on experience with Kubernetes, container orchestration, and cloud platforms (AWS).
- Expertise in monitoring, alerting, and metrics collection using Prometheus or similar systems.
- Proven track record of implementing SRE best practices, including incident management, capacity planning, and reliability engineering.
Skills
kubernetesgopythonprometheusaws