onsite
Senior Site Reliability Engineer - NewDay
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, performance, and automation across a cloud-native platform using Python, Kubernetes, Docker, and AWS. Lead initiatives to eliminate toil, implement observability, and shape modern SRE practices.
About the role
Key Responsibilities
- Design, build, and maintain highly available, scalable services on Kubernetes and AWS.
- Automate deployment pipelines, configuration management, and incident response using Terraform, CI/CD, and scripting.
- Implement observability stack (Prometheus, Grafana, Loki) to monitor performance, detect anomalies, and drive proactive improvements.
- Collaborate with development teams to embed reliability best practices into code reviews and release processes.
- Lead post‑mortem analysis, root‑cause investigations, and continuous improvement initiatives.
Requirements
- 5+ years of experience in site reliability or DevOps roles.
- Strong proficiency in Python and/or Go for automation and tooling.
- Hands‑on experience with Kubernetes, Docker, and cloud infrastructure (AWS).
- Expertise in IaC (Terraform) and CI/CD pipelines.
- Deep understanding of monitoring, alerting, and incident management.
Skills
pythonkubernetesdockerawsterraformprometheusgrafana