remote
Principal Site Reliability Engineer - AI Infrastructure Operations - nSCALE
Site Reliability Engineer
Lead AI infrastructure operations as a Principal Site Reliability Engineer, designing and scaling GPU‑cloud services on Kubernetes, automating deployments with Terraform and CI/CD, and ensuring high availability and observability using Prometheus, Grafana, and Python scripting.
About the role
Key Responsibilities
- Architect and maintain highly available GPU‑cloud infrastructure for AI workloads on Kubernetes, ensuring scalability, performance, and cost efficiency.
- Design and implement CI/CD pipelines, leveraging Terraform, GitOps, and container registries to automate deployment and configuration management.
- Develop and maintain observability solutions with Prometheus, Grafana, and custom alerting to detect and remediate incidents proactively.
- Collaborate with data science and ML teams to optimize GPU utilization, job scheduling, and resource allocation.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve reliability and resilience.
Requirements
- 10+ years of experience in site reliability engineering, with a strong focus on cloud-native GPU workloads.
- Proficiency in Kubernetes, Docker, and container orchestration at scale.
- Hands‑on experience with AWS services (EKS, EC2, S3, CloudWatch) and IaC tools such as Terraform.
- Strong scripting skills in Python (or Go) for automation and tooling.
- Deep knowledge of monitoring, alerting, and incident management using Prometheus, Grafana, and related tools.
Skills
kubernetesdockerawsterraformprometheusgrafanapython