remote
Senior Engineering Manager - Site Reliability - Trackunit
Engineering Manager
Lead a global SRE team to build scalable, observable, and secure platform infrastructure for a large‑scale IoT product, driving cross‑regional strategy and tooling excellence.
About the role
Key Responsibilities
- Lead and mentor a distributed SRE team across North America and Europe, setting technical direction and fostering a culture of reliability.
- Design, implement, and maintain scalable infrastructure using Kubernetes, Docker, and Terraform on AWS.
- Build and enhance observability stack with Prometheus, Grafana, and custom alerting to ensure 99.99% uptime for 200+ developers.
- Drive continuous integration and delivery pipelines, automating deployments and rollbacks for rapid feature delivery.
- Collaborate with product, security, and DevOps to define and enforce best practices for security, compliance, and incident response.
Requirements
- 10+ years of engineering experience with 5+ years in a senior SRE or DevOps leadership role.
- Deep expertise in Kubernetes, container orchestration, and cloud infrastructure (AWS preferred).
- Proven track record building observability solutions with Prometheus, Grafana, and related tooling.
- Strong background in CI/CD, automation, and infrastructure as code (Terraform, Ansible).
- Excellent communication skills and experience leading cross‑regional teams.
Skills
kubernetesprometheusgrafanacicdawsdockerterraform