remote

Senior Site Reliability Engineer - Kody

Site Reliability Engineer

Senior SRE responsible for reliability, scalability, and observability of a global payment platform, managing incident response, service‑level objectives, and cloud infrastructure across Kubernetes and AWS environments.

About the role

Key Responsibilities

Lead production on‑call rotation, acting as primary responder for payment‑service incidents.
Design, implement, and maintain observability stacks (metrics, logs, tracing) using Prometheus, Grafana, and related tools.
Define, monitor, and improve service‑level objectives (SLOs) and error‑budget policies for mission‑critical payment workloads.
Automate infrastructure provisioning and configuration management with Terraform and AWS services.
Collaborate with development and security teams to ensure secure, highly‑available Kubernetes deployments across multiple regions.

Requirements

5+ years of SRE or DevOps experience in large‑scale, globally distributed systems.
Deep expertise with Kubernetes orchestration and Linux system administration.
Strong background in cloud platforms, preferably AWS, including networking, IAM, and managed services.
Proficiency in infrastructure‑as‑code tools such as Terraform and scripting languages (Python, Bash).
Hands‑on experience with monitoring and alerting solutions (Prometheus, Grafana, Loki) and incident‑management processes.

Skills

kubernetesawsterraformprometheusgrafanalinuxpython

CompanyKody

DepartmentEngineering

LocationSan Francisco, California, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 25, 2026