onsite
Site Reliability Engineer - Apple
Site Reliability Engineer
Drive reliability and scalability for Apple’s global services, building resilient infrastructure, automating deployments, and responding to incidents using Python, Kubernetes, and advanced monitoring tools.
About the role
Key Responsibilities
- Design, build, and maintain highly available infrastructure that supports Apple’s services such as iCloud, iTunes, Siri, and Maps.
- Automate deployment pipelines and configuration management to accelerate feature delivery while ensuring stability.
- Implement and refine monitoring, alerting, and incident response processes to minimize downtime and improve service reliability.
- Collaborate with software engineering teams to embed reliability best practices into product development.
- Analyze performance metrics, conduct post‑mortems, and drive continuous improvement initiatives.
Requirements
- Proven experience as a Site Reliability Engineer or similar role in a large-scale, high‑traffic environment.
- Strong scripting skills in Python and familiarity with container orchestration (Kubernetes) and cloud platforms.
- Deep understanding of monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, ELK).
- Excellent problem‑solving abilities and a proactive approach to incident management.
- Effective communication skills and a collaborative mindset.