remote
Senior Site Reliability Engineer SRE - Kody
Site Reliability Engineer
Senior SRE who drives platform scalability and reliability, implements guardrails, automates operations, and leads incident response using Kubernetes, Docker, AWS, Terraform, Prometheus and CI/CD pipelines.
About the role
Key Responsibilities
- Design, implement, and maintain reliability guardrails in partnership with Platform Engineering to ensure high‑availability services.
- Develop and operate monitoring, alerting, and observability solutions (e.g., Prometheus, Grafana) to detect and diagnose production issues.
- Automate infrastructure provisioning and configuration management using Terraform and CI/CD pipelines.
- Manage container orchestration platforms (Kubernetes, Docker) to support scalable deployments.
- Lead incident response, perform root‑cause analysis, and drive post‑mortem improvements.
- Collaborate with development teams to embed reliability best practices into the software development lifecycle.
Requirements
- 5+ years of experience in site reliability, DevOps, or systems engineering.
- Strong hands‑on expertise with Kubernetes, Docker, and cloud services (AWS preferred).
- Proficiency in infrastructure‑as‑code tools such as Terraform and automation pipelines (Jenkins, GitHub Actions, etc.).
- Experience building monitoring and alerting stacks using Prometheus, Grafana, or similar tools.
- Demonstrated ability to handle high‑pressure incidents, perform thorough post‑mortems, and drive continuous improvement.
Skills
kubernetesdockerawsterraformprometheuscicd