remote
SRE Observability Technical Lead - Vice President - Citi
Engineering Manager
Lead the design and operation of observability solutions for a global financial platform, driving reliability, performance, and automation across Kubernetes, AWS, and CI/CD pipelines.
About the role
Key Responsibilities
- Architect and maintain end‑to‑end observability stack (Prometheus, Grafana, Loki) for high‑availability services.
- Lead incident response and post‑mortem processes, ensuring continuous improvement of SLOs and SLAs.
- Collaborate with development teams to embed observability best practices into CI/CD pipelines and infrastructure as code.
- Drive automation of alerting, monitoring, and capacity planning using Terraform and AWS CloudWatch.
- Mentor and coach SRE teams, fostering a culture of reliability and proactive problem‑solving.
Requirements
- 5+ years of SRE or DevOps experience in a large, distributed environment.
- Deep expertise with Kubernetes, Prometheus, Grafana, and AWS services.
- Proficiency in Terraform, Python, and CI/CD tooling (GitHub Actions, Jenkins).
- Strong analytical skills and a track record of improving system reliability.
- Excellent communication and leadership abilities.
Skills
kubernetesprometheusgrafanaawsterraformcicd