remote

Principal Site Reliability Engineer - AI Infrastructure Operations - nSCALE

Site Reliability Engineer

Lead AI infrastructure operations as a Principal Site Reliability Engineer, designing and scaling GPU‑cloud services on Kubernetes, automating deployments with Terraform and CI/CD, and ensuring high availability and observability using Prometheus, Grafana, and Python scripting.

About the role

Key Responsibilities

Architect and maintain highly available GPU‑cloud infrastructure for AI workloads on Kubernetes, ensuring scalability, performance, and cost efficiency.
Design and implement CI/CD pipelines, leveraging Terraform, GitOps, and container registries to automate deployment and configuration management.
Develop and maintain observability solutions with Prometheus, Grafana, and custom alerting to detect and remediate incidents proactively.
Collaborate with data science and ML teams to optimize GPU utilization, job scheduling, and resource allocation.
Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve reliability and resilience.

Requirements

10+ years of experience in site reliability engineering, with a strong focus on cloud-native GPU workloads.
Proficiency in Kubernetes, Docker, and container orchestration at scale.
Hands‑on experience with AWS services (EKS, EC2, S3, CloudWatch) and IaC tools such as Terraform.
Strong scripting skills in Python (or Go) for automation and tooling.
Deep knowledge of monitoring, alerting, and incident management using Prometheus, Grafana, and related tools.

Skills

kubernetesdockerawsterraformprometheusgrafanapython

CompanynSCALE

DepartmentOperations

LocationSan Francisco, CA, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 19, 2026