onsite
Staff Site Reliability Engineer, Cloud Reliability Intelligence - Google
Site Reliability Engineer
Senior SRE leader driving reliability for cloud services, designing end‑to‑end observability, automation, and AI‑powered workflow improvements across full‑stack architectures.
About the role
Key Responsibilities
- Architect, implement, and operate highly reliable, scalable cloud services supporting critical workloads.
- Lead cross‑functional teams to design, analyze, and troubleshoot distributed systems, ensuring end‑to‑end performance and availability.
- Develop and integrate AI/LLM‑based automation to streamline incident response, root‑cause analysis, and operational workflows.
- Define and enforce reliability standards, service‑level objectives, and policy conformance across the organization.
- Mentor engineers, drive technical roadmaps, and oversee project delivery from concept through production.
Requirements
- 8+ years of experience with data structures, algorithms, and large‑scale system design.
- 3+ years of hands‑on leadership in building and operating distributed, full‑stack cloud architectures.
- Proven track record applying Generative AI or LLMs to automate reliability and operational processes.
- Deep expertise in site reliability engineering practices, including monitoring, alerting, incident management, and capacity planning.
- Strong communication and mentorship skills, with the ability to influence technical direction across multiple teams.