remote

Site Reliability Engineer SRE - AI Platforms - HSBC

Site Reliability Engineer

Drive reliability and scalability for AI platform services, leveraging Kubernetes, Docker, and observability tools while automating deployments with CI/CD pipelines and Python scripting.

About the role

Key Responsibilities

Design, implement, and maintain highly available AI platform services on Kubernetes clusters.
Develop and manage CI/CD pipelines to automate build, test, and deployment processes.
Implement monitoring, alerting, and logging solutions using Prometheus, Grafana, and ELK stack.
Collaborate with data science and ML teams to ensure seamless integration of AI workloads.
Conduct post‑incident reviews, root cause analysis, and implement preventive measures.

Requirements

Proven experience as an SRE or DevOps engineer in a cloud environment.
Strong proficiency with Kubernetes, Docker, and container orchestration.
Hands‑on experience with monitoring tools (Prometheus, Grafana) and log management.
Solid scripting skills in Python and familiarity with CI/CD tools (GitLab CI, Jenkins, ArgoCD).
Excellent problem‑solving abilities and a proactive approach to reliability.

Skills

kubernetesdockerprometheusgrafanapythoncicd

CompanyHSBC

DepartmentEngineering

LocationENG, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 24, 2026