onsite
AI Reliability Engineer SRE - GenAI & LLM Infrastructure - DMS Vision Inc
AI Engineer
Lead the design, deployment, and operation of highly available GenAI and LLM infrastructure, ensuring reliability, scalability, and observability across Kubernetes clusters and cloud environments.
About the role
Key Responsibilities
- Architect and maintain resilient Kubernetes-based platforms for GenAI and LLM workloads.
- Implement end-to-end observability, monitoring, and alerting to detect and remediate incidents.
- Collaborate with data science and ML teams to optimize model deployment pipelines.
- Automate infrastructure provisioning and configuration using IaC tools.
- Ensure security, compliance, and cost-efficiency across cloud environments.
Requirements
- 5+ years of SRE or DevOps experience with Kubernetes at scale.
- Deep knowledge of AI/ML infrastructure, including LLM and GenAI workloads.
- Proficiency in observability stacks (Prometheus, Grafana, Loki, etc.).
- Hands‑on experience with cloud platforms (AWS, GCP, or Azure) and IaC (Terraform, Pulumi).
- Strong scripting skills (Python, Bash) and CI/CD pipeline expertise.