remote
Site Reliability Engineering SRE / Observability Technical Lead - NTT DATA
Engineering Manager
Lead SRE and observability initiatives, designing and operating reliable, scalable cloud platforms using Kubernetes, Terraform, and AWS while driving monitoring, alerting, and automation with Prometheus, Grafana, Python, and Go.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS using Terraform and Kubernetes.
- Lead the observability strategy: build end‑to‑end monitoring, logging, and tracing pipelines with Prometheus, Grafana, and related tools.
- Develop automation scripts and services in Python and Go to improve incident response, remediation, and reliability.
- Collaborate with development and product teams to embed SRE best practices into the software delivery lifecycle.
- Mentor junior engineers, conduct post‑mortems, and drive continuous improvement of reliability processes.
Requirements
- 5+ years of experience in site reliability engineering or related roles.
- Strong hands‑on expertise with Kubernetes, Terraform, and AWS cloud services.
- Proficiency in building observability solutions using Prometheus, Grafana, and related ecosystems.
- Solid programming/scripting skills in Python and Go.
- Demonstrated ability to lead technical initiatives, mentor teams, and communicate complex concepts clearly.
Skills
kubernetesprometheusgrafanaterraformawspythongo