remote

Senior Site Reliability Engineer SRE - Kody

Site Reliability Engineer

Senior SRE who drives platform scalability and reliability, implements guardrails, automates operations, and leads incident response using Kubernetes, Docker, AWS, Terraform, Prometheus and CI/CD pipelines.

About the role

Key Responsibilities

Design, implement, and maintain reliability guardrails in partnership with Platform Engineering to ensure high‑availability services.
Develop and operate monitoring, alerting, and observability solutions (e.g., Prometheus, Grafana) to detect and diagnose production issues.
Automate infrastructure provisioning and configuration management using Terraform and CI/CD pipelines.
Manage container orchestration platforms (Kubernetes, Docker) to support scalable deployments.
Lead incident response, perform root‑cause analysis, and drive post‑mortem improvements.
Collaborate with development teams to embed reliability best practices into the software development lifecycle.

Requirements

5+ years of experience in site reliability, DevOps, or systems engineering.
Strong hands‑on expertise with Kubernetes, Docker, and cloud services (AWS preferred).
Proficiency in infrastructure‑as‑code tools such as Terraform and automation pipelines (Jenkins, GitHub Actions, etc.).
Experience building monitoring and alerting stacks using Prometheus, Grafana, or similar tools.
Demonstrated ability to handle high‑pressure incidents, perform thorough post‑mortems, and drive continuous improvement.

Skills

kubernetesdockerawsterraformprometheuscicd

CompanyKody

DepartmentEngineering

LocationSan Jose, California, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 23, 2026