remote

Senior Site Reliability Engineer - Ellucian

Site Reliability Engineer

Senior Site Reliability Engineer driving reliability, scalability, and automation for a cloud‑native SaaS platform using Kubernetes, Docker, Prometheus, Grafana, AWS, Terraform, and CI/CD pipelines.

About the role

Key Responsibilities

Design, implement, and maintain highly available, scalable infrastructure on AWS using Terraform and Kubernetes.
Build and manage observability stack with Prometheus, Grafana, and Loki to ensure proactive monitoring and alerting.
Automate deployment pipelines with CI/CD tools (GitHub Actions, ArgoCD) and enforce GitOps practices.
Collaborate with development teams to optimize application performance, reduce latency, and improve deployment frequency.
Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.

Requirements

5+ years of SRE or DevOps experience in a cloud‑native environment.
Proficiency with Kubernetes, Docker, and container orchestration best practices.
Hands‑on experience with AWS services (EKS, EC2, RDS, S3) and IaC using Terraform.
Strong scripting skills in Bash, Python, or Go for automation.
Excellent communication, problem‑solving, and collaboration skills.

Skills

kubernetesdockerprometheusgrafanaawsterraformcicd

CompanyEllucian

DepartmentEngineering

LocationVA, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 19, 2026