remoteonsite
Engineering Manager, Site Reliability Engineering SRE - athenahealth
Engineering Manager
Hands‑on Engineering Manager leading an SRE team to improve reliability, observability, and automation of cloud‑native services, driving incident response and operational readiness across Linux‑based infrastructure.
About the role
Key Responsibilities
- Lead, mentor, and grow a high‑performing SRE team focused on service reliability and operational excellence.
- Design and implement automation frameworks for provisioning, configuration, and scaling of Linux‑based cloud infrastructure.
- Develop and maintain observability solutions, including metrics, logging, and tracing, to provide deep insight into system health.
- Own incident management processes, conduct post‑mortems, and drive continuous improvement to reduce mean time to recovery.
- Collaborate with product and engineering teams to embed reliability and resiliency best practices into the software development lifecycle.
Requirements
- 5+ years of hands‑on experience managing Linux infrastructure in cloud environments (AWS, Azure, or GCP).
- Proven track record building and scaling observability platforms and automation pipelines.
- Strong incident response background with experience leading post‑mortem analyses and driving root‑cause remediation.
- Excellent people‑leadership skills, with the ability to coach engineers and foster a culture of reliability.
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent practical experience.