onsite

Machine Learning Infrastructure Engineer

As a Machine Learning Infrastructure Engineer, you will design, operate, and optimize GPU infrastructure for model hosting and serving. This role involves building scalable model serving systems, implementing multi-model routing, and owning the end-to-end model lifecycle, while driving inference optimizations and building self-service platforms.

About the role

About the Role

We are seeking a Machine Learning Infrastructure Engineer to design and operate cutting-edge GPU infrastructure and model serving systems. This role involves end-to-end ownership of the model lifecycle, from deployment to monitoring and scaling, while also driving critical inference optimizations.

Responsibilities

Design and operate GPU infrastructure for model hosting, including provisioning, scheduling, and cost optimization across cloud and on-premise environments.
Build and scale model serving systems using vLLM, TensorRT-LLM, Triton, or equivalent, supporting real-time inference with strong latency and availability guarantees.
Implement multi-model routing to serve multiple models across modalities (text, voice, code, vision) on shared infrastructure.
Own the model lifecycle end to end: download, deploy, serve, monitor, swap, and scale.
Drive inference optimization including quantization strategies (AWQ, GPTQ), batching, caching, and cold start reduction.
Build self-service infrastructure platforms where teams provision compute, storage, and model endpoints through APIs and control planes.
Implement infrastructure-as-code at scale using Terraform, Pulumi, or CDK.
Build observability and reliability for inference systems: SLIs/SLOs, GPU utilization monitoring, latency tracking, automated capacity planning, and alerting.
Define platform standards and governance including multi-tenant isolation, cost attribution, and resource quotas.
Lead architectural design and influence engineering direction across the AI infrastructure stack.

Skills

GPUVllmTensorRT LLMTritonquantizationAWQGPTQapiTerraformPulumiCDK

CompanyPaytm

DepartmentEngineering

LocationIndia

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 11, 2026

About the role

About the Role

Responsibilities

Design and operate GPU infrastructure for model hosting, including provisioning, scheduling, and cost optimization across cloud and on-premise environments.
Build and scale model serving systems using vLLM, TensorRT-LLM, Triton, or equivalent, supporting real-time inference with strong latency and availability guarantees.
Implement multi-model routing to serve multiple models across modalities (text, voice, code, vision) on shared infrastructure.
Own the model lifecycle end to end: download, deploy, serve, monitor, swap, and scale.
Drive inference optimization including quantization strategies (AWQ, GPTQ), batching, caching, and cold start reduction.
Build self-service infrastructure platforms where teams provision compute, storage, and model endpoints through APIs and control planes.
Implement infrastructure-as-code at scale using Terraform, Pulumi, or CDK.
Build observability and reliability for inference systems: SLIs/SLOs, GPU utilization monitoring, latency tracking, automated capacity planning, and alerting.
Define platform standards and governance including multi-tenant isolation, cost attribution, and resource quotas.
Lead architectural design and influence engineering direction across the AI infrastructure stack.