onsite

Site Reliability Engineer Observability - NFU Mutual

Site Reliability Engineer

Site Reliability Engineer focused on building end‑to‑end observability, improving reliability and performance of services using Prometheus, Grafana, Python, Kubernetes and cloud platforms.

About the role

Key Responsibilities

Design, implement and maintain observability pipelines that provide real‑time metrics, traces and logs across all production services.
Develop and extend monitoring solutions using Prometheus, Grafana and custom Python exporters.
Collaborate with development and operations teams to embed reliability best practices into CI/CD workflows.
Automate incident detection, alerting and post‑mortem processes to reduce mean time to resolution.
Drive continuous improvement of infrastructure reliability on cloud platforms such as AWS, leveraging Kubernetes and IaC tools.

Requirements

Strong experience with observability stacks (Prometheus, Grafana, OpenTelemetry or similar).
Proficiency in Python for scripting, automation and building exporters.
Hands‑on experience with container orchestration (Kubernetes) and cloud environments (AWS).
Solid understanding of SRE principles, incident management and performance tuning.
Ability to work collaboratively in a hybrid team, communicating complex technical concepts clearly.

Skills

prometheusgrafanapythonkubernetesaws

CompanyNFU Mutual

DepartmentEngineering

LocationStratford-upon-Avon, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Salary55,000

Posted June 26, 2026