The Principal Engineer, AI Infrastructure will design, build, and scale GPU-accelerated AI infrastructure to enable real-time inference for various AI models. This role involves end-to-end ownership of the model lifecycle, from deployment and serving to optimization and monitoring, while leading architectural design and influencing engineering direction.
About the role
About the Role
We are seeking a highly experienced and technically profound Principal Engineer, AI Infrastructure to design, build, and scale our GPU-accelerated AI infrastructure. You will be instrumental in enabling real-time inference for a variety of AI models across different modalities and driving the entire model lifecycle from deployment to optimization.
Responsibilities
Design and operate GPU infrastructure for model hosting, including provisioning, scheduling, and cost optimization across cloud and on-premise environments.
Build and scale model serving systems using vLLM, TensorRT-LLM, Triton, or equivalent, supporting real-time inference with strong latency and availability guarantees.
Implement multi-model routing to serve multiple models across modalities (text, voice, code, vision) on shared infrastructure.
Own the model lifecycle end to end: download, deploy, serve, monitor, swap, and scale.
Drive inference optimization including quantization strategies (AWQ, GPTQ), batching, caching, and cold start reduction.
Build self-service infrastructure platforms where teams provision compute, storage, and model endpoints through APIs and control planes.
Implement infrastructure-as-code at scale using Terraform, Pulumi, or CDK.
Build observability and reliability for inference systems: SLIs/SLOs, GPU utilization monitoring, latency tracking, automated capacity planning, and alerting.
Define platform standards and governance including multi-tenant isolation, cost attribution, and resource quotas.
Lead architectural design and influence engineering direction across the AI infrastructure stack.