remoteonsite
Associate Director - SRE & Observability Engineer AI Infrastructure - Deloitte
Site Reliability Engineer
Lead the design and scaling of reliable, high‑performance AI/GenAI platforms, driving SRE principles, observability, and automation across cloud environments to ensure availability, scalability, and cost efficiency.
About the role
Key Responsibilities
- Architect and implement SRE frameworks for AI/GenAI workloads, including LLM training, inference, vector databases, and data pipelines.
- Design end‑to‑end observability solutions—metrics, logs, traces—to provide real‑time insight into system health and performance.
- Drive automation of deployment, scaling, and incident response using CI/CD pipelines and infrastructure‑as‑code.
- Collaborate with data science, security, and platform teams to embed reliability best practices across the AI stack.
- Lead incident management, post‑mortem analysis, and continuous improvement initiatives to reduce MTTR and prevent recurrence.
Requirements
- 10+ years of experience in SRE, DevOps, or reliability engineering, with a strong focus on AI or data‑intensive systems.
- Proficiency with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).
- Hands‑on expertise in observability tools (Prometheus, Grafana, Jaeger, ELK) and automation frameworks (Terraform, Ansible, GitOps).
- Excellent problem‑solving skills, ability to work in a fast‑paced, cross‑functional environment.
- Strong communication and leadership skills, with experience mentoring technical teams.
Skills
mlopsllmragpythonawsgcpazurekubernetes