onsite
Site Reliability Engineer - Guidehouse
Site Reliability Engineer
Lead the design, automation, and incident response for cloud‑native services, driving reliability and scalability using Kubernetes, CI/CD pipelines, and AWS infrastructure within an Agile Scrum environment.
About the role
Key Responsibilities
- Collaborate with cross‑functional teams to establish and evolve SRE practices within an Agile Scrum framework.
- Participate in system design reviews, identifying failure points and championing automation and self‑healing solutions.
- Conduct code reviews focused on efficiency, testability, and scalability of infrastructure and application components.
- Lead incident management ceremonies, performing root‑cause analysis and implementing preventive measures to reduce downtime.
- Develop and maintain comprehensive documentation for systems, processes, and runbooks.
Requirements
- Proven experience as a Site Reliability Engineer or similar role, with strong knowledge of Kubernetes and container orchestration.
- Hands‑on expertise in CI/CD pipelines, GitOps, and automated deployment strategies on AWS.
- Solid understanding of monitoring, alerting, and observability tools (Prometheus, Grafana, ELK).
- Experience with scripting and automation (Python, Bash, Terraform).
- Strong communication skills and ability to work effectively in an Agile Scrum environment.