remote

Senior Lead Site Reliability Engineer - AI/ML and Data Platforms - JPMorganChase

Site Reliability Engineer

Lead the design and operation of highly available AI/ML and data platform services, driving non‑functional requirements, monitoring, and incident response across large‑scale data lake ecosystems using Kubernetes, Docker, CI/CD pipelines, and AWS infrastructure.

About the role

Key Responsibilities

Define and enforce non‑functional requirements and availability targets for AI/ML and data platform services.
Embed reliability and performance metrics into product design, testing, and deployment pipelines.
Design and maintain scalable, highly available Kubernetes clusters and Docker-based workloads.
Implement and manage CI/CD pipelines, automated testing, and release processes.
Develop and maintain monitoring, alerting, and incident response frameworks using Prometheus, Grafana, and cloud-native tools.
Collaborate with data scientists, platform engineers, and security teams to ensure secure, compliant, and resilient data pipelines.

Requirements

10+ years of experience in site reliability engineering or related roles.
Deep expertise in Kubernetes, Docker, and cloud-native infrastructure (AWS preferred).
Proven track record of designing and operating large‑scale, highly available data platforms.
Strong scripting skills (Python, Bash) and experience with CI/CD tools (Jenkins, GitHub Actions, ArgoCD).
Excellent communication and leadership skills, with the ability to mentor and guide cross‑functional teams.

Skills

kubernetesdockercicdaws

CompanyJPMorganChase

DepartmentEngineering

LocationJersey City, NJ, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 19, 2026