onsite
Senior Site Reliability Engineer - Heartbeat AI GmbH
Site Reliability Engineer
Lead the design, deployment, and maintenance of highly available, scalable platform services using Kubernetes, Docker, and cloud-native observability tools. Drive automation, reliability, and performance across the entire infrastructure stack.
About the role
Key Responsibilities
- Architect, implement, and operate production-grade Kubernetes clusters and containerized services.
- Design and maintain CI/CD pipelines, ensuring rapid, reliable releases.
- Implement monitoring, alerting, and logging with Prometheus, Grafana, and ELK stack.
- Automate infrastructure provisioning and configuration using Terraform and IaC best practices.
- Collaborate with development teams to optimize application performance and resilience.
- Respond to incidents, conduct post‑mortems, and drive continuous improvement.
Requirements
- 5+ years of experience in site reliability or DevOps roles.
- Deep knowledge of Kubernetes, Docker, and cloud platforms (AWS preferred).
- Proficiency with monitoring, logging, and alerting tools (Prometheus, Grafana, ELK).
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (GitHub Actions, Jenkins).
- Hands‑on experience with Terraform or similar IaC tools.
Skills
kubernetesdockerprometheusgrafanacicdawsterraform