remote
Site Reliability Engineer, Senior - Booz Allen Hamilton
Site Reliability Engineer
Senior Site Reliability Engineer focused on building resilient, automated infrastructure for the Intelligence Community using Kubernetes, Prometheus, Grafana, AWS, and Terraform to reduce toil and improve system reliability.
About the role
Key Responsibilities
- Design, deploy, and maintain highly available Kubernetes clusters and associated services.
- Implement comprehensive monitoring with Prometheus and Grafana, creating alerts and dashboards for critical metrics.
- Automate infrastructure provisioning and configuration using Terraform and AWS CloudFormation.
- Develop and maintain Bash and Python scripts to reduce manual toil and enable self‑repair mechanisms.
- Collaborate with development teams to embed SRE best practices into CI/CD pipelines and application deployments.
Requirements
- 5+ years of experience in site reliability, DevOps, or systems engineering roles.
- Proficiency with Kubernetes, Docker, and container orchestration at scale.
- Strong scripting skills in Bash and Python, with a track record of automating routine tasks.
- Hands‑on experience with Prometheus, Grafana, and alerting systems.
- Experience deploying and managing workloads on AWS, including EC2, EKS, and related services.
Skills
pythonbashkubernetesprometheusgrafanaawsterraform