remote
Site Reliability Engineer - Booz Allen Hamilton
Site Reliability Engineer
Site Reliability Engineer focused on building resilient, automated cloud infrastructure for the Intelligence Community, leveraging monitoring, redundancy, and scripting to reduce toil and improve system reliability.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable cloud infrastructure to support mission-critical applications.
- Develop and deploy automation scripts to reduce manual toil and enable self‑repair capabilities.
- Implement comprehensive monitoring, alerting, and logging solutions to detect and resolve incidents proactively.
- Collaborate with development and operations teams to embed reliability best practices into the software delivery lifecycle.
- Conduct post‑incident reviews, root‑cause analysis, and continuous improvement initiatives.
Requirements
- Proven experience in Site Reliability Engineering or related roles with a strong focus on cloud platforms.
- Hands‑on expertise in automation tools (e.g., Terraform, Ansible, Python scripting).
- Deep knowledge of monitoring and observability stacks (e.g., Prometheus, Grafana, ELK).
- Strong understanding of networking, security, and high‑availability design principles.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
awskubernetesdockerlinuxjenkinsjiraconfluence