remote

Senior Site Reliability Engineer Agentic Search - Tavily

Site Reliability Engineer

Lead the design and operation of scalable, highly available infrastructure for agentic web interaction, leveraging Kubernetes, Docker, Terraform, and AWS to support real‑time RAG and AI workloads.

About the role

Key Responsibilities

Architect, deploy, and maintain a resilient Kubernetes cluster that supports high‑throughput, low‑latency AI inference and web‑scraping workloads.
Implement IaC using Terraform to provision and manage AWS resources, ensuring repeatable, auditable infrastructure.
Design and maintain observability pipelines with Prometheus, Grafana, and Loki, providing real‑time metrics, alerts, and dashboards for production systems.
Collaborate with backend and ML teams to optimize container images, CI/CD pipelines, and deployment strategies for rapid feature rollout.
Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve system reliability.

Requirements

5+ years of SRE or DevOps experience in a fast‑moving AI or web‑scale environment.
Proficiency with Kubernetes, Docker, and Terraform on AWS.
Strong scripting skills in Python (or Go) for automation and tooling.
Hands‑on experience with Prometheus, Grafana, and alerting best practices.
Excellent problem‑solving skills, with a track record of improving system reliability and performance.

Skills

kubernetesdockerterraformawspythonprometheusgrafana

CompanyTavily

DepartmentEngineering

LocationNew York, NY, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 19, 2026