remote
Lead Site Reliability Engineer - JPMorganChase
Site Reliability Engineer
Lead Site Reliability Engineer driving resiliency, scalability, and reliability for enterprise‑grade services using Kubernetes, AWS, Docker, Terraform, and advanced monitoring. Own design reviews, mentor teams, and shape SRE best practices across large‑scale products.
About the role
Key Responsibilities
- Lead design and execution of resiliency reviews for medium to large‑sized products, ensuring high availability and fault tolerance.
- Mentor and coach engineering teams on SRE principles, incident management, and automation best practices.
- Architect and maintain scalable, secure infrastructure using Kubernetes, Docker, Terraform, and AWS services.
- Drive continuous improvement of monitoring, alerting, and observability pipelines to reduce MTTR and improve service health.
- Collaborate with cross‑functional teams to translate business requirements into reliable, maintainable technical solutions.
Requirements
- 5+ years of SRE or DevOps experience in a large enterprise environment.
- Deep expertise in Kubernetes, Docker, Terraform, and AWS (EC2, EKS, RDS, S3).
- Proven track record of designing and operating highly available, scalable systems.
- Strong scripting skills (Python, Bash) and experience with CI/CD pipelines.
- Excellent communication, leadership, and problem‑solving abilities.
Skills
kubernetesawsdockerterraform