remote

Site Reliability Engineer - NTT DATA

Site Reliability Engineer

Site Reliability Engineer driving 24/7 uptime for mission‑critical UPS.com services, designing scalable GCP/Azure infrastructure, managing Kubernetes clusters, and building CI/CD pipelines with Terraform, Argo CD, and GitOps practices.

About the role

Key Responsibilities

Operate and maintain 24/7 on‑call rotation for mission‑critical UPS.com applications.
Design, deploy, and scale highly available cloud infrastructure across GCP and Azure.
Implement and enforce SLOs, SLIs, error budgets, and incident response processes.
Build and evolve internal developer platforms to enable self‑service and accelerate delivery.
Manage Kubernetes environments (GKE and OpenShift), including operators and service mesh integration.
Develop Infrastructure as Code with Terraform and Config Connector.
Create CI/CD pipelines and GitOps workflows using Argo CD and Azure Pipelines.
Enhance observability, monitoring, and alerting to reduce mean time to recovery.

Requirements

Proven experience with GCP and Azure cloud services.
Strong background in Kubernetes administration and service mesh.
Hands‑on expertise with Terraform, Config Connector, Argo CD, and Azure Pipelines.
Deep understanding of SLO/SLI concepts and incident management.
Excellent scripting skills (Python, Bash) and ability to automate operational tasks.

Skills

kubernetesterraform

CompanyNTT DATA

DepartmentEngineering

LocationToronto, CA, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026