hybrid

Platform Engineer (AI/LLM Infrastructure)

The Platform Engineer (AI/LLM Infrastructure) will lead the design, implementation, and operation of scalable infrastructure platforms for AI/LLM solutions for enterprise clients. This role involves acting as a hands-on technical lead, owning end-to-end infrastructure architecture, and partnering with clients to deliver robust AI infrastructure solutions, including managing Kubernetes, RAG pipelines, and GPU infrastructure.

About the role

About the Role

As a Platform Engineer specializing in AI/LLM Infrastructure, you will play a critical role in designing, implementing, and operating scalable infrastructure platforms for AI/LLM-based solutions for enterprise clients. This is a hands-on technical lead position where you will contribute to development while guiding a team of engineers. You will own the end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.

Day to Day Job Duties

Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients.
Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers.
Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions.
Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC.
Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management.
Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar).
Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux).
Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps.
Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch.
Lead incident response, on-call processes, and post-mortem analysis.
Ensure strong security posture and lead InfoSec review processes.
Coordinate delivery across multiple teams and client engagements.

Basic Qualifications

5–8 years of experience in Platform Engineering, SRE, or Infrastructure Engineering.
3+ years of proven experience delivering and leading infrastructure for AI/LLM-based production systems.
Strong hands-on expertise in Kubernetes, Docker, Helm.
3+ years of experience with Terraform and GitOps (ArgoCD/Flux).
3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines).
3+ years of experience leading client-facing technical engagements.
3+ years of experience managing multiple concurrent projects or teams.
3+ years of hands-on experience with incident management and SLA-driven environments.
3+ years of experience leading security/InfoSec reviews.
Strong understanding of vector databases, RAG pipelines, and LLM inference systems.
3+ years of experience with CI/CD and container registry management.
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.

Nice to Have (But Not Required)

Experience with AWS in addition to Azure.
Familiarity with Azure API Management and AKS.
Experience with Pulumi (Python/TypeScript).
Knowledge of NIM deployment and lifecycle management.
Python scripting for infrastructure automation.
Experience with load testing tools (k6, Locust, JMeter).
Exposure to FinOps and cost optimization practices.

About the role

About the Role

Day to Day Job Duties

Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients.
Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers.
Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security.
Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions.
Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC.
Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management.
Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar).
Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux).
Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps.
Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch.
Lead incident response, on-call processes, and post-mortem analysis.
Ensure strong security posture and lead InfoSec review processes.
Coordinate delivery across multiple teams and client engagements.

Basic Qualifications

5–8 years of experience in Platform Engineering, SRE, or Infrastructure Engineering.
3+ years of proven experience delivering and leading infrastructure for AI/LLM-based production systems.
Strong hands-on expertise in Kubernetes, Docker, Helm.
3+ years of experience with Terraform and GitOps (ArgoCD/Flux).
3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines).
3+ years of experience leading client-facing technical engagements.
3+ years of experience managing multiple concurrent projects or teams.
3+ years of hands-on experience with incident management and SLA-driven environments.
3+ years of experience leading security/InfoSec reviews.
Strong understanding of vector databases, RAG pipelines, and LLM inference systems.
3+ years of experience with CI/CD and container registry management.
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.

Nice to Have (But Not Required)

Experience with AWS in addition to Azure.
Familiarity with Azure API Management and AKS.
Experience with Pulumi (Python/TypeScript).
Knowledge of NIM deployment and lifecycle management.
Python scripting for infrastructure automation.
Experience with load testing tools (k6, Locust, JMeter).
Exposure to FinOps and cost optimization practices.

Platform Engineer (AI/LLM Infrastructure)

About the role

About the Role

Day to Day Job Duties

Basic Qualifications

Nice to Have (But Not Required)

Platform Engineer (AI/LLM Infrastructure)

About the role

About the Role

Day to Day Job Duties

Basic Qualifications

Nice to Have (But Not Required)

Skills