remote
Site Reliability Engineer - Metova Federal
Site Reliability Engineer
Site Reliability Engineer responsible for designing, deploying, and maintaining highly available, scalable infrastructure on AWS using Kubernetes, Docker, Terraform, and monitoring tools to support mission‑critical federal applications.
About the role
Key Responsibilities
- Design, implement, and manage scalable, highly available Kubernetes clusters on AWS for mission‑critical workloads.
- Automate infrastructure provisioning and configuration using Terraform, ensuring repeatable and auditable deployments.
- Implement CI/CD pipelines with GitHub Actions or Jenkins to streamline application releases and rollbacks.
- Monitor system health with Prometheus, Grafana, and CloudWatch, proactively identifying and resolving performance bottlenecks.
- Collaborate with development teams to enforce best practices for observability, security, and cost optimization.
Requirements
- 3+ years of experience in site reliability or DevOps roles, preferably in federal or defense environments.
- Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, CI/CD tooling, and monitoring/alerting stacks.
- Strong scripting skills in Bash or Python for automation.
- Excellent problem‑solving skills and ability to work in a fast‑paced, mission‑critical setting.
Skills
kubernetesdockerawsterraform