remoteonsite
Site Reliability Engineer III - JPMorganChase
Site Reliability Engineer
Senior Site Reliability Engineer focused on building and optimizing cloud-native infrastructure for AI/ML and data platforms using Python, AWS, Kubernetes, Terraform, Docker, and CI/CD pipelines, while ensuring high availability and performance through robust monitoring and automation.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, highly available infrastructure for AI/ML and data platform services on AWS.
- Develop and manage Kubernetes clusters, Helm charts, and Terraform modules to automate deployment and configuration.
- Build and maintain CI/CD pipelines using GitHub Actions, Jenkins, or similar tools to streamline code delivery.
- Implement monitoring, alerting, and logging solutions with Prometheus, Grafana, and ELK stack to ensure system reliability.
- Collaborate with data scientists, developers, and security teams to optimize performance, cost, and compliance.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles.
- Proficiency in Python scripting and automation.
- Hands‑on experience with AWS services (EKS, EC2, S3, RDS, CloudWatch).
- Strong knowledge of Kubernetes, Helm, and Terraform for infrastructure as code.
- Experience with CI/CD, containerization (Docker), and monitoring tools (Prometheus, Grafana).
Skills
pythonawskubernetesterraformdockercicdprometheus