remote

Site Reliability Engineering SRE / Observability Technical Lead - NTT DATA

Engineering Manager

Lead SRE and observability initiatives, designing and operating reliable, scalable cloud platforms using Kubernetes, Terraform, and AWS while driving monitoring, alerting, and automation with Prometheus, Grafana, Python, and Go.

About the role

Key Responsibilities

Design, implement, and maintain highly available, scalable infrastructure on AWS using Terraform and Kubernetes.
Lead the observability strategy: build end‑to‑end monitoring, logging, and tracing pipelines with Prometheus, Grafana, and related tools.
Develop automation scripts and services in Python and Go to improve incident response, remediation, and reliability.
Collaborate with development and product teams to embed SRE best practices into the software delivery lifecycle.
Mentor junior engineers, conduct post‑mortems, and drive continuous improvement of reliability processes.

Requirements

5+ years of experience in site reliability engineering or related roles.
Strong hands‑on expertise with Kubernetes, Terraform, and AWS cloud services.
Proficiency in building observability solutions using Prometheus, Grafana, and related ecosystems.
Solid programming/scripting skills in Python and Go.
Demonstrated ability to lead technical initiatives, mentor teams, and communicate complex concepts clearly.

Skills

kubernetesprometheusgrafanaterraformawspythongo

CompanyNTT DATA

DepartmentEngineering

LocationLondon, United Kingdom

Experience7+ years

Tenurefull-time

LevelLead

Posted June 26, 2026