onsite

Site Reliability Engineering Leader - kontakt.io

Software Engineer

Lead a high‑performing Site Reliability Engineering team to ensure platform reliability, performance, and scalability for a healthcare IoT platform, leveraging Kubernetes, AWS, Terraform, and modern observability tools.

About the role

Key Responsibilities

Define and execute the SRE strategy, driving reliability, availability, and performance targets for a real‑time healthcare platform.
Build, scale, and maintain Kubernetes‑based infrastructure on AWS, using IaC tools such as Terraform.
Implement and evolve monitoring, alerting, and observability pipelines with Prometheus, Grafana, and related tooling.
Lead incident response, post‑mortems, and continuous improvement processes to reduce MTTR and prevent recurrence.
Mentor and grow a team of SRE engineers, fostering a culture of automation, ownership, and proactive reliability.

Requirements

5+ years of hands‑on SRE or DevOps experience, with at least 2 years in a leadership role.
Deep expertise in Kubernetes orchestration and AWS cloud services.
Proficiency with infrastructure‑as‑code (Terraform or CloudFormation) and CI/CD pipelines.
Strong background in observability stacks (Prometheus, Grafana, Loki) and incident management.
Excellent communication skills and ability to collaborate with engineering, product, and operations teams.

Skills

kubernetesawsterraformprometheuscicd

Companykontakt.io

DepartmentEngineering

LocationUnited States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 25, 2026