remote
Site Reliability Engineering Manager Data Infra - ComplyAdvantage
Engineering Manager
Lead a high‑performing SRE team to build resilient, scalable, and secure data infrastructure using Kubernetes, AWS, and advanced observability tools, driving automation, incident excellence, and continuous improvement in a fast‑paced fintech environment.
About the role
Key Responsibilities
- Lead, mentor, and grow a team of SREs, fostering a culture of ownership and continuous learning.
- Design and implement scalable, highly available data infrastructure on Kubernetes and AWS, ensuring performance and reliability at scale.
- Drive observability strategy: implement monitoring, logging, and tracing solutions to detect, diagnose, and prevent incidents.
- Establish and refine incident response processes, including runbooks, post‑mortems, and blameless reviews.
- Automate deployment pipelines, configuration management, and operational tasks using CI/CD and IaC best practices.
- Collaborate with Engineering, Product, and Security teams to embed security and compliance into the infrastructure lifecycle.
Requirements
- 5+ years of SRE or DevOps experience, with 2+ years in a leadership role.
- Hands‑on expertise with Kubernetes, AWS services (EC2, EKS, RDS, S3), and container orchestration.
- Strong background in observability tools (Prometheus, Grafana, ELK, or similar) and incident management frameworks.
- Proficiency in scripting (Python, Bash) and automation tools (Terraform, Ansible, GitHub Actions).
- Excellent communication skills and a proven ability to influence cross‑functional teams.