remote
Senior Site Reliability Engineer - Kody
Site Reliability Engineer
Senior SRE responsible for reliability, scalability, and observability of a global payment platform, managing incident response, service‑level objectives, and cloud infrastructure across Kubernetes and AWS environments.
About the role
Key Responsibilities
- Lead production on‑call rotation, acting as primary responder for payment‑service incidents.
- Design, implement, and maintain observability stacks (metrics, logs, tracing) using Prometheus, Grafana, and related tools.
- Define, monitor, and improve service‑level objectives (SLOs) and error‑budget policies for mission‑critical payment workloads.
- Automate infrastructure provisioning and configuration management with Terraform and AWS services.
- Collaborate with development and security teams to ensure secure, highly‑available Kubernetes deployments across multiple regions.
Requirements
- 5+ years of SRE or DevOps experience in large‑scale, globally distributed systems.
- Deep expertise with Kubernetes orchestration and Linux system administration.
- Strong background in cloud platforms, preferably AWS, including networking, IAM, and managed services.
- Proficiency in infrastructure‑as‑code tools such as Terraform and scripting languages (Python, Bash).
- Hands‑on experience with monitoring and alerting solutions (Prometheus, Grafana, Loki) and incident‑management processes.
Skills
kubernetesawsterraformprometheusgrafanalinuxpython