onsite

GPU Infrastructure Engineer, AI Platform

This role involves designing and operating GPU infrastructure for AI model hosting, ensuring efficient provisioning, scheduling, and cost optimization. The engineer will build and scale model serving systems, implement multi-model routing, and own the end-to-end model lifecycle while driving inference optimization and building self-service infrastructure platforms.

About the role

Responsibilities

Design and operate GPU infrastructure for model hosting, including provisioning, scheduling, and cost optimization across cloud and on-premise environments
Build and scale model serving systems using vLLM, TensorRT-LLM, Triton, or equivalent, supporting real-time inference with strong latency and availability guarantees
Implement multi-model routing to serve multiple models across modalities (text, voice, code, vision) on shared infrastructure
Own the model lifecycle end to end: download, deploy, serve, monitor, swap, and scale
Drive inference optimization including quantization strategies (AWQ, GPTQ), batching, caching, and cold start reduction
Build self-service infrastructure platforms where teams provision compute, storage, and model endpoints through APIs and control planes
Implement infrastructure-as-code at scale using Terraform, Pulumi, or CDK
Build observability and reliability for inference systems: SLIs/SLOs, GPU utilization monitoring, latency tracking, automated capacity planning, and alerting
Define platform standards and governance including multi-tenant isolation, cost attribution, and resource quotas
Lead architectural design and influence engineering direction across the AI infrastructure stack

Skills

VllmTensorRT LLMTritonAWQGPTQTerraformPulumiCDKGPUApis

DepartmentEngineering

LocationSan Francisco, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 2, 2026