onsite
Senior Site Reliability Engineer SRE - accelbyte
Site Reliability Engineer
Senior SRE responsible for designing, automating, and operating highly available cloud infrastructure, leveraging Kubernetes, Docker, Terraform, and monitoring tools to ensure reliability and performance of critical services.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, fault‑tolerant infrastructure on AWS using IaC tools such as Terraform.
- Develop and manage container orchestration platforms (Kubernetes, Docker) to support micro‑service deployments.
- Build and maintain CI/CD pipelines, automating build, test, and release processes.
- Implement comprehensive monitoring, alerting, and observability solutions with Prometheus, Grafana, and log aggregation tools.
- Collaborate with development teams to improve application reliability, performance, and incident response.
- Participate in on‑call rotation, conduct root‑cause analysis, and drive post‑mortem improvements.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, with a strong focus on cloud platforms (AWS).
- Proficiency in container technologies (Kubernetes, Docker) and infrastructure as code (Terraform, CloudFormation).
- Solid scripting/programming skills in Python or similar languages.
- Hands‑on experience with monitoring and observability stacks (Prometheus, Grafana, ELK/EFK).
- Deep understanding of Linux systems, networking, and CI/CD concepts.
Skills
kubernetesdockerterraformprometheusgrafanapythonawscicd