remote
Senior Site Reliability Engineer - Ellucian
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, scalability, and automation for a cloud‑native SaaS platform using Kubernetes, Docker, Prometheus, Grafana, AWS, Terraform, and CI/CD pipelines.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS using Terraform and Kubernetes.
- Build and manage observability stack with Prometheus, Grafana, and Loki to ensure proactive monitoring and alerting.
- Automate deployment pipelines with CI/CD tools (GitHub Actions, ArgoCD) and enforce GitOps practices.
- Collaborate with development teams to optimize application performance, reduce latency, and improve deployment frequency.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
Requirements
- 5+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with Kubernetes, Docker, and container orchestration best practices.
- Hands‑on experience with AWS services (EKS, EC2, RDS, S3) and IaC using Terraform.
- Strong scripting skills in Bash, Python, or Go for automation.
- Excellent communication, problem‑solving, and collaboration skills.
Skills
kubernetesdockerprometheusgrafanaawsterraformcicd