remote

Site Reliability Engineer Remote Travel - Omni Federal

Site Reliability Engineer

Remote Site Reliability Engineer with 30% travel, responsible for designing, deploying, and maintaining highly available cloud infrastructure using Kubernetes, Docker, AWS, and Terraform. Leverages monitoring tools like Prometheus and Grafana, and automates operations with Python and CI/CD pipelines.

About the role

Key Responsibilities

Design, implement, and manage scalable, highly available Kubernetes clusters on AWS, ensuring zero downtime and optimal performance.
Automate infrastructure provisioning and configuration using Terraform, maintaining version-controlled IaC repositories.
Develop and maintain CI/CD pipelines (GitHub Actions, Jenkins) to streamline application deployments and rollbacks.
Implement comprehensive monitoring and alerting with Prometheus, Grafana, and CloudWatch, proactively identifying and resolving incidents.
Collaborate with development teams to enforce best practices in logging, security hardening, and disaster recovery.
Participate in on-call rotations, incident response, and post‑mortem analysis to continuously improve reliability.

Requirements

5+ years of experience in site reliability or DevOps roles, with a strong focus on cloud-native technologies.
Proficient in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Hands‑on experience with Terraform, Prometheus, Grafana, and CI/CD tooling.
Strong scripting skills in Python or Bash for automation and tooling.
Excellent problem‑solving abilities, communication skills, and a proactive, collaborative mindset.

Skills

kubernetesdockerawsterraformprometheusgrafanapython

CompanyOmni Federal

DepartmentEngineering

LocationVA, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 20, 2026