onsite
Staff Software Engineer, Site Reliability Engineering - Google
Software Engineer
Lead the design, implementation, and operation of large‑scale, fault‑tolerant services, applying software engineering best practices to ensure high availability and performance across distributed cloud environments.
About the role
Key Responsibilities
- Architect, develop, and maintain highly available services and tooling using languages such as Go and Python.
- Design and implement automation for deployment, scaling, and incident response on Kubernetes and cloud platforms.
- Analyze system performance, conduct root‑cause analysis, and drive reliability improvements for distributed architectures.
- Build monitoring, alerting, and observability pipelines to proactively detect and resolve issues.
- Mentor engineers, establish best practices, and contribute to SRE culture across teams.
Requirements
- 8+ years of software development experience with strong proficiency in at least one systems language (e.g., Go, C++, Java).
- 3+ years designing, troubleshooting, and operating large‑scale distributed systems.
- Hands‑on experience with container orchestration (Kubernetes) and cloud infrastructure (GCP, AWS, or Azure).
- Deep understanding of monitoring, logging, and incident management frameworks.
- Bachelor’s degree in Computer Science or related field; advanced degree is a plus.