remote
Senior Site Reliability Engineer Agentic Search - Tavily
Site Reliability Engineer
Lead the design and operation of scalable, highly available infrastructure for agentic web interaction, leveraging Kubernetes, Docker, Terraform, and AWS to support real‑time RAG and AI workloads.
About the role
Key Responsibilities
- Architect, deploy, and maintain a resilient Kubernetes cluster that supports high‑throughput, low‑latency AI inference and web‑scraping workloads.
- Implement IaC using Terraform to provision and manage AWS resources, ensuring repeatable, auditable infrastructure.
- Design and maintain observability pipelines with Prometheus, Grafana, and Loki, providing real‑time metrics, alerts, and dashboards for production systems.
- Collaborate with backend and ML teams to optimize container images, CI/CD pipelines, and deployment strategies for rapid feature rollout.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve system reliability.
Requirements
- 5+ years of SRE or DevOps experience in a fast‑moving AI or web‑scale environment.
- Proficiency with Kubernetes, Docker, and Terraform on AWS.
- Strong scripting skills in Python (or Go) for automation and tooling.
- Hands‑on experience with Prometheus, Grafana, and alerting best practices.
- Excellent problem‑solving skills, with a track record of improving system reliability and performance.
Skills
kubernetesdockerterraformawspythonprometheusgrafana