remote
Senior Lead Site Reliability Engineer - JPMorganChase
Site Reliability Engineer
Lead a high‑performing SRE team to define, implement, and monitor reliability standards for critical financial services, leveraging Kubernetes, Docker, Terraform, Python, and AWS to meet stringent availability and performance targets.
About the role
Key Responsibilities
- Define and enforce non‑functional requirements (NFRs) and availability targets for core banking and compliance services.
- Design, implement, and maintain scalable, highly available infrastructure using Kubernetes, Docker, and Terraform on AWS.
- Develop and own monitoring, alerting, and observability solutions to track service level indicators (SLIs) and service level objectives (SLOs).
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system reliability.
- Mentor and guide SRE engineers, fostering best practices in automation, CI/CD pipelines, and code quality.
Requirements
- 5+ years of experience in site reliability or production engineering, preferably in financial services.
- Strong expertise with Kubernetes, Docker, Terraform, and AWS cloud services.
- Proficiency in Python for automation, scripting, and tooling development.
- Hands‑on experience building monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, ELK).
- Demonstrated ability to lead incident management, conduct thorough post‑mortems, and drive reliability improvements.
Skills
kubernetesdockerterraformpythonawscicd