onsite
Site Reliability Engineer Observability - NFU Mutual
Site Reliability Engineer
Site Reliability Engineer focused on building end‑to‑end observability, improving reliability and performance of services using Prometheus, Grafana, Python, Kubernetes and cloud platforms.
About the role
Key Responsibilities
- Design, implement and maintain observability pipelines that provide real‑time metrics, traces and logs across all production services.
- Develop and extend monitoring solutions using Prometheus, Grafana and custom Python exporters.
- Collaborate with development and operations teams to embed reliability best practices into CI/CD workflows.
- Automate incident detection, alerting and post‑mortem processes to reduce mean time to resolution.
- Drive continuous improvement of infrastructure reliability on cloud platforms such as AWS, leveraging Kubernetes and IaC tools.
Requirements
- Strong experience with observability stacks (Prometheus, Grafana, OpenTelemetry or similar).
- Proficiency in Python for scripting, automation and building exporters.
- Hands‑on experience with container orchestration (Kubernetes) and cloud environments (AWS).
- Solid understanding of SRE principles, incident management and performance tuning.
- Ability to work collaboratively in a hybrid team, communicating complex technical concepts clearly.
Skills
prometheusgrafanapythonkubernetesaws