remote
Senior Lead Site Reliability Engineer - AI/ML and Data Platforms - JPMorganChase
Site Reliability Engineer
Lead the design and operation of highly available AI/ML and data platform services, driving non‑functional requirements, monitoring, and incident response across large‑scale data lake ecosystems using Kubernetes, Docker, CI/CD pipelines, and AWS infrastructure.
About the role
Key Responsibilities
- Define and enforce non‑functional requirements and availability targets for AI/ML and data platform services.
- Embed reliability and performance metrics into product design, testing, and deployment pipelines.
- Design and maintain scalable, highly available Kubernetes clusters and Docker-based workloads.
- Implement and manage CI/CD pipelines, automated testing, and release processes.
- Develop and maintain monitoring, alerting, and incident response frameworks using Prometheus, Grafana, and cloud-native tools.
- Collaborate with data scientists, platform engineers, and security teams to ensure secure, compliant, and resilient data pipelines.
Requirements
- 10+ years of experience in site reliability engineering or related roles.
- Deep expertise in Kubernetes, Docker, and cloud-native infrastructure (AWS preferred).
- Proven track record of designing and operating large‑scale, highly available data platforms.
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (Jenkins, GitHub Actions, ArgoCD).
- Excellent communication and leadership skills, with the ability to mentor and guide cross‑functional teams.
Skills
kubernetesdockercicdaws