remote
Sr. SRE II - Filevine
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, automation, and scalability for a high‑growth Legal AI platform using Kubernetes, Docker, AWS, and modern observability tools.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure for a cloud‑native Legal AI platform.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
- Build and maintain CI/CD pipelines, infrastructure as code (Terraform), and container orchestration (Kubernetes).
- Implement monitoring, alerting, and observability with Prometheus, Grafana, and custom dashboards.
- Collaborate with development, security, and product teams to enforce best practices and drive automation.
Requirements
- 5+ years of SRE or DevOps experience in a fast‑paced, cloud‑first environment.
- Proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, CI/CD tooling (GitHub Actions, Jenkins, ArgoCD), and scripting (Python, Bash).
- Strong knowledge of monitoring, alerting, and incident management tools (Prometheus, Grafana, PagerDuty).
- Excellent communication skills and a collaborative mindset.
Skills
kubernetesdockercicdawsterraformprometheusgrafana