remote
Senior Site Reliability Engineer - Filevine
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, automating, and scaling highly available cloud infrastructure, implementing monitoring, and driving reliability best practices using Python, Kubernetes, AWS, Terraform, and CI/CD pipelines.
About the role
Key Responsibilities
- Design, build, and maintain scalable, fault‑tolerant infrastructure on AWS supporting a high‑throughput legal AI platform.
- Develop automation scripts and services in Python to streamline provisioning, configuration, and deployment workflows.
- Implement and manage container orchestration with Kubernetes, ensuring optimal performance, security, and resource utilization.
- Create and maintain infrastructure as code using Terraform, enabling repeatable and auditable environment creation.
- Establish robust monitoring, alerting, and observability stacks with Prometheus and Grafana to proactively detect and resolve incidents.
- Collaborate with development and product teams to define SLOs/SLIs, conduct post‑mortems, and continuously improve reliability processes.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, preferably in SaaS or AI‑driven environments.
- Strong proficiency in Python for automation and tooling.
- Deep hands‑on experience with Kubernetes, AWS services, and Terraform.
- Expertise in monitoring, logging, and alerting solutions such as Prometheus, Grafana, and ELK stack.
- Solid understanding of CI/CD pipelines, containerization, and Linux system administration.
Skills
pythonkubernetesawsterraformprometheuscicd