remoteonsite

Associate Director - SRE & Observability Engineer AI Infrastructure - Deloitte

Site Reliability Engineer

Lead the design and scaling of reliable, high‑performance AI/GenAI platforms, driving SRE principles, observability, and automation across cloud environments to ensure availability, scalability, and cost efficiency.

About the role

Key Responsibilities

Architect and implement SRE frameworks for AI/GenAI workloads, including LLM training, inference, vector databases, and data pipelines.
Design end‑to‑end observability solutions—metrics, logs, traces—to provide real‑time insight into system health and performance.
Drive automation of deployment, scaling, and incident response using CI/CD pipelines and infrastructure‑as‑code.
Collaborate with data science, security, and platform teams to embed reliability best practices across the AI stack.
Lead incident management, post‑mortem analysis, and continuous improvement initiatives to reduce MTTR and prevent recurrence.

Requirements

10+ years of experience in SRE, DevOps, or reliability engineering, with a strong focus on AI or data‑intensive systems.
Proficiency with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
Hands‑on expertise in observability tools (Prometheus, Grafana, Jaeger, ELK) and automation frameworks (Terraform, Ansible, GitOps).
Excellent problem‑solving skills, ability to work in a fast‑paced, cross‑functional environment.
Strong communication and leadership skills, with experience mentoring technical teams.

Skills

mlopsllmragpythonawsgcpazurekubernetes

CompanyDeloitte

DepartmentEngineering

LocationKarnataka, India

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026