remote
Site Reliability Engineer SRE - AI Platforms - HSBC
Site Reliability Engineer
Drive reliability and scalability for AI platform services, leveraging Kubernetes, Docker, and observability tools while automating deployments with CI/CD pipelines and Python scripting.
About the role
Key Responsibilities
- Design, implement, and maintain highly available AI platform services on Kubernetes clusters.
- Develop and manage CI/CD pipelines to automate build, test, and deployment processes.
- Implement monitoring, alerting, and logging solutions using Prometheus, Grafana, and ELK stack.
- Collaborate with data science and ML teams to ensure seamless integration of AI workloads.
- Conduct post‑incident reviews, root cause analysis, and implement preventive measures.
Requirements
- Proven experience as an SRE or DevOps engineer in a cloud environment.
- Strong proficiency with Kubernetes, Docker, and container orchestration.
- Hands‑on experience with monitoring tools (Prometheus, Grafana) and log management.
- Solid scripting skills in Python and familiarity with CI/CD tools (GitLab CI, Jenkins, ArgoCD).
- Excellent problem‑solving abilities and a proactive approach to reliability.
Skills
kubernetesdockerprometheusgrafanapythoncicd