hybrid
Platform Engineer (AI/LLM Infrastructure)
Platform Engineer (AI/LLM Infrastructure)
The Platform Engineer (AI/LLM Infrastructure) will lead the design, implementation, and operation of scalable infrastructure platforms for AI/LLM solutions for enterprise clients. This role involves acting as a hands-on technical lead, owning end-to-end infrastructure architecture, and partnering with clients to deliver robust AI infrastructure solutions, including managing Kubernetes, RAG pipelines, and GPU infrastructure.
About the role
About the Role
As a Platform Engineer specializing in AI/LLM Infrastructure, you will play a critical role in designing, implementing, and operating scalable infrastructure platforms for AI/LLM-based solutions for enterprise clients. This is a hands-on technical lead position where you will contribute to development while guiding a team of engineers. You will own the end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
Day to Day Job Duties
- Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients.
- Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers.
- Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
- Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions.
- Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC.
- Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management.
- Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar).
- Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux).
- Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps.
- Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch.
- Lead incident response, on-call processes, and post-mortem analysis.
- Ensure strong security posture and lead InfoSec review processes.
- Coordinate delivery across multiple teams and client engagements.
Basic Qualifications
- 5–8 years of experience in Platform Engineering, SRE, or Infrastructure Engineering.
- 3+ years of proven experience delivering and leading infrastructure for AI/LLM-based production systems.
- Strong hands-on expertise in Kubernetes, Docker, Helm.
- 3+ years of experience with Terraform and GitOps (ArgoCD/Flux).
- 3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines).
- 3+ years of experience leading client-facing technical engagements.
- 3+ years of experience managing multiple concurrent projects or teams.
- 3+ years of hands-on experience with incident management and SLA-driven environments.
- 3+ years of experience leading security/InfoSec reviews.
- Strong understanding of vector databases, RAG pipelines, and LLM inference systems.
- 3+ years of experience with CI/CD and container registry management.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
Nice to Have (But Not Required)
- Experience with AWS in addition to Azure.
- Familiarity with Azure API Management and AKS.
- Experience with Pulumi (Python/TypeScript).
- Knowledge of NIM deployment and lifecycle management.
- Python scripting for infrastructure automation.
- Experience with load testing tools (k6, Locust, JMeter).
- Exposure to FinOps and cost optimization practices.