remote
Site Reliability Engineer - NTT DATA
Site Reliability Engineer
Site Reliability Engineer driving 24/7 uptime for mission‑critical UPS.com services, designing scalable GCP/Azure infrastructure, managing Kubernetes clusters, and building CI/CD pipelines with Terraform, Argo CD, and GitOps practices.
About the role
Key Responsibilities
- Operate and maintain 24/7 on‑call rotation for mission‑critical UPS.com applications.
- Design, deploy, and scale highly available cloud infrastructure across GCP and Azure.
- Implement and enforce SLOs, SLIs, error budgets, and incident response processes.
- Build and evolve internal developer platforms to enable self‑service and accelerate delivery.
- Manage Kubernetes environments (GKE and OpenShift), including operators and service mesh integration.
- Develop Infrastructure as Code with Terraform and Config Connector.
- Create CI/CD pipelines and GitOps workflows using Argo CD and Azure Pipelines.
- Enhance observability, monitoring, and alerting to reduce mean time to recovery.
Requirements
- Proven experience with GCP and Azure cloud services.
- Strong background in Kubernetes administration and service mesh.
- Hands‑on expertise with Terraform, Config Connector, Argo CD, and Azure Pipelines.
- Deep understanding of SLO/SLI concepts and incident management.
- Excellent scripting skills (Python, Bash) and ability to automate operational tasks.
Skills
kubernetesterraform