onsite

Infrastructure Site Reliability Engineer - HPC - Sarvam

Site Reliability Engineer

Site Reliability Engineer focused on high‑performance computing infrastructure, managing a multi‑vendor GPU fleet for large‑scale AI training workloads using Kubernetes, Terraform, and monitoring tools.

About the role

Key Responsibilities

Design, deploy, and operate a scalable GPU‑powered HPC platform supporting continuous AI training workloads.
Implement and maintain Kubernetes clusters across heterogeneous GPU hardware, ensuring high availability and performance.
Automate infrastructure provisioning and configuration using Terraform and Ansible, reducing manual intervention.
Develop monitoring, alerting, and observability solutions with Prometheus and Grafana to proactively detect and resolve issues.
Collaborate with AI research and engineering teams to optimize resource utilization and troubleshoot job failures.
Establish and enforce SRE best practices, including incident response, post‑mortem analysis, and capacity planning.

Requirements

3+ years of experience in SRE or DevOps roles, preferably with HPC or GPU‑intensive environments.
Strong proficiency in Linux system administration and scripting (Python or Bash).
Hands‑on experience with Kubernetes orchestration, GPU drivers, and container runtimes.
Expertise in infrastructure‑as‑code tools such as Terraform and configuration management with Ansible.
Solid understanding of monitoring stacks (Prometheus, Grafana) and incident management processes.

Skills

kuberneteslinuxterraformprometheusansiblepython

CompanySarvam

DepartmentEngineering

LocationKarnataka, India

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 26, 2026