onsite
Site Reliability Engineering Leader - kontakt.io
Software Engineer
Lead a high‑performing Site Reliability Engineering team to ensure platform reliability, performance, and scalability for a healthcare IoT platform, leveraging Kubernetes, AWS, Terraform, and modern observability tools.
About the role
Key Responsibilities
- Define and execute the SRE strategy, driving reliability, availability, and performance targets for a real‑time healthcare platform.
- Build, scale, and maintain Kubernetes‑based infrastructure on AWS, using IaC tools such as Terraform.
- Implement and evolve monitoring, alerting, and observability pipelines with Prometheus, Grafana, and related tooling.
- Lead incident response, post‑mortems, and continuous improvement processes to reduce MTTR and prevent recurrence.
- Mentor and grow a team of SRE engineers, fostering a culture of automation, ownership, and proactive reliability.
Requirements
- 5+ years of hands‑on SRE or DevOps experience, with at least 2 years in a leadership role.
- Deep expertise in Kubernetes orchestration and AWS cloud services.
- Proficiency with infrastructure‑as‑code (Terraform or CloudFormation) and CI/CD pipelines.
- Strong background in observability stacks (Prometheus, Grafana, Loki) and incident management.
- Excellent communication skills and ability to collaborate with engineering, product, and operations teams.
Skills
kubernetesawsterraformprometheuscicd