remote

Senior Manager, Site Reliability Engineering - Clover Health

Software Engineer

Leads a global SRE team, overseeing day‑to‑day reliability and long‑term platform strategy. Drives automation, cloud infrastructure, and observability using Kubernetes, AWS, Terraform, Python, and CI/CD pipelines.

About the role

At Counterpart Health, we are transforming healthcare and improving patient care with our innovative primary care tool, Counterpart Assistant. By supporting Primary Care Physicians (PCPs), we deliver improved outcomes at lower cost through early diagnosis and longitudinal care management of chronic conditions.

We're looking for a Senior Manager of Site Reliability Engineering to join our team. You'll lead a team of ~10 SREs across North America, UK, HK, and New Zealand — owning both the day-to-day operations and the long-term technical direction of the SRE organization. This role sits at the intersection of people leadership, technical depth, and strategic partnership: you're here to make Counterpart’s infrastructure reliable, scalable, and cost-efficient — and to transform the SRE team's engagement model from reactive support to proactive collaboration with our product engineering pillars.

As a Senior Manager, Site Reliability Engineering, you will:

Lead and grow our SRE team of ~10 engineers, including hiring, retention, career development, and performance management across multiple time zones (US, HK, NZ).
Build strategic partnerships with product engineering pillars — shifting SRE from reactive, ticket-based support to proactive co-ownership of reliability outcomes.
Scale our multi-tenant infrastructure to support new customer onboarding and growing patient populations.
Own cloud cost management and FinOps practices, building frameworks that balance cost control with reliability and performance.
Champion developer self-service and platform engineering. Build self-service capabilities so product teams can manage routine operations without filing SRE tickets. Establish SLOs/SLIs for critical services and improve alert quality so every page is meaningful.
Ensure the SRE team is fully leveraging AI tooling in their workflows — using tools like Claude Code for IaC generation, log analysis, root cause investigation, and automating repetitive work — at the same level as the rest of engineering.

You should get in touch if:

You have 6+ years managing an SRE team and 10+ years of hands-on SRE or infrastructure engineering experience.
You're deeply comfortable with our core stack: Kubernetes, GCP (GKE, Cloud SQL, Pub/Sub, GCS), Terraform, Helm, ArgoCD, PostgreSQL, and Prometheus/Grafana.
You have strong programming skills in Python and/or Go, and you're comfortable writing and reviewing infrastructure tooling code — including using AI coding tools to do so.
You have experience with CI/CD pipelines (GitHub Actions) and a track record of building or improving developer tooling and automation.
You have sound build vs. buy judgment — you default to the right answer, not the easiest one, and you're comfortable building internal tooling when existing solutions don't fit.
You have experience leading teams across mult

About the role

As a Senior Manager, Site Reliability Engineering, you will:

Lead and grow our SRE team of ~10 engineers, including hiring, retention, career development, and performance management across multiple time zones (US, HK, NZ).
Build strategic partnerships with product engineering pillars — shifting SRE from reactive, ticket-based support to proactive co-ownership of reliability outcomes.
Scale our multi-tenant infrastructure to support new customer onboarding and growing patient populations.
Own cloud cost management and FinOps practices, building frameworks that balance cost control with reliability and performance.
Champion developer self-service and platform engineering. Build self-service capabilities so product teams can manage routine operations without filing SRE tickets. Establish SLOs/SLIs for critical services and improve alert quality so every page is meaningful.
Ensure the SRE team is fully leveraging AI tooling in their workflows — using tools like Claude Code for IaC generation, log analysis, root cause investigation, and automating repetitive work — at the same level as the rest of engineering.

You should get in touch if:

You have 6+ years managing an SRE team and 10+ years of hands-on SRE or infrastructure engineering experience.
You're deeply comfortable with our core stack: Kubernetes, GCP (GKE, Cloud SQL, Pub/Sub, GCS), Terraform, Helm, ArgoCD, PostgreSQL, and Prometheus/Grafana.
You have strong programming skills in Python and/or Go, and you're comfortable writing and reviewing infrastructure tooling code — including using AI coding tools to do so.
You have experience with CI/CD pipelines (GitHub Actions) and a track record of building or improving developer tooling and automation.
You have sound build vs. buy judgment — you default to the right answer, not the easiest one, and you're comfortable building internal tooling when existing solutions don't fit.
You have experience leading teams across mult

Senior Manager, Site Reliability Engineering - Clover Health

About the role

Senior Manager, Site Reliability Engineering - Clover Health

About the role

Skills