remote
Site Reliability Engineer Remote Travel - Omni Federal
Site Reliability Engineer
Remote Site Reliability Engineer with 30% travel, responsible for designing, deploying, and maintaining highly available cloud infrastructure using Kubernetes, Docker, AWS, and Terraform. Leverages monitoring tools like Prometheus and Grafana, and automates operations with Python and CI/CD pipelines.
About the role
Key Responsibilities
- Design, implement, and manage scalable, highly available Kubernetes clusters on AWS, ensuring zero downtime and optimal performance.
- Automate infrastructure provisioning and configuration using Terraform, maintaining version-controlled IaC repositories.
- Develop and maintain CI/CD pipelines (GitHub Actions, Jenkins) to streamline application deployments and rollbacks.
- Implement comprehensive monitoring and alerting with Prometheus, Grafana, and CloudWatch, proactively identifying and resolving incidents.
- Collaborate with development teams to enforce best practices in logging, security hardening, and disaster recovery.
- Participate in on-call rotations, incident response, and post‑mortem analysis to continuously improve reliability.
Requirements
- 5+ years of experience in site reliability or DevOps roles, with a strong focus on cloud-native technologies.
- Proficient in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, Prometheus, Grafana, and CI/CD tooling.
- Strong scripting skills in Python or Bash for automation and tooling.
- Excellent problem‑solving abilities, communication skills, and a proactive, collaborative mindset.
Skills
kubernetesdockerawsterraformprometheusgrafanapython