onsite
Infrastructure Site Reliability Engineer - HPC - Sarvam
Site Reliability Engineer
Site Reliability Engineer focused on high‑performance computing infrastructure, managing a multi‑vendor GPU fleet for large‑scale AI training workloads using Kubernetes, Terraform, and monitoring tools.
About the role
Key Responsibilities
- Design, deploy, and operate a scalable GPU‑powered HPC platform supporting continuous AI training workloads.
- Implement and maintain Kubernetes clusters across heterogeneous GPU hardware, ensuring high availability and performance.
- Automate infrastructure provisioning and configuration using Terraform and Ansible, reducing manual intervention.
- Develop monitoring, alerting, and observability solutions with Prometheus and Grafana to proactively detect and resolve issues.
- Collaborate with AI research and engineering teams to optimize resource utilization and troubleshoot job failures.
- Establish and enforce SRE best practices, including incident response, post‑mortem analysis, and capacity planning.
Requirements
- 3+ years of experience in SRE or DevOps roles, preferably with HPC or GPU‑intensive environments.
- Strong proficiency in Linux system administration and scripting (Python or Bash).
- Hands‑on experience with Kubernetes orchestration, GPU drivers, and container runtimes.
- Expertise in infrastructure‑as‑code tools such as Terraform and configuration management with Ansible.
- Solid understanding of monitoring stacks (Prometheus, Grafana) and incident management processes.
Skills
kuberneteslinuxterraformprometheusansiblepython