remoteonsite
Senior Site Reliability Engineer - LSEG (London Stock Exchange Group)
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, operating, and improving highly available cloud services on AWS and Azure, driving automation, monitoring, and performance optimization for critical shared platforms.
About the role
Key Responsibilities
- Own the reliability, performance, and scalability of mission‑critical cloud services, defining and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
- Design, implement, and maintain automated deployment pipelines and infrastructure‑as‑code using Terraform, CI/CD tools, and container orchestration (Kubernetes).
- Develop monitoring, alerting, and incident‑response frameworks across AWS and Azure environments, ensuring rapid detection and resolution of issues.
- Collaborate with development and product teams to embed reliability best practices into the software development lifecycle.
- Lead continuous‑improvement initiatives, conducting post‑mortems, capacity planning, and performance tuning to reduce downtime and latency.
Requirements
- 5+ years of experience in site reliability or production engineering, with deep hands‑on expertise in AWS and Azure services.
- Proficiency in infrastructure‑as‑code (Terraform, CloudFormation) and container orchestration (Kubernetes, Docker).
- Strong scripting/programming skills, preferably in Python, for automation and tooling.
- Experience building robust monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, CloudWatch, Azure Monitor).
- Solid understanding of networking, security, and high‑availability architectures in cloud environments.
Skills
awsazurekubernetesterraformpython