remote

AI Infrastructure Engineer - Zoom Communications

Devops Engineer

Lead the design and scaling of large‑scale training pipelines for LLMs, leveraging Python, PyTorch, and Kubernetes on AWS to deliver high‑performance, cost‑efficient AI infrastructure.

About the role

Key Responsibilities

Architect, build, and maintain end‑to‑end training pipelines for large language models, ensuring reliability and scalability.
Optimize GPU utilization and memory management across distributed clusters, reducing training time and cost.
Integrate CI/CD workflows with Docker and Kubernetes to automate model training, validation, and deployment.
Collaborate with data scientists and ML engineers to translate research prototypes into production‑ready systems.
Monitor system performance, troubleshoot bottlenecks, and implement proactive capacity planning.

Requirements

5+ years of experience in AI infrastructure or large‑scale distributed systems.
Proficiency in Python, PyTorch, and TensorFlow, with hands‑on experience in distributed training frameworks.
Strong background in Kubernetes, Docker, and cloud services (AWS, GCP, or Azure).
Deep understanding of GPU architecture, CUDA, and performance tuning.
Excellent problem‑solving skills and a track record of delivering production‑grade AI solutions.

Skills

pythonpytorchtensorflowkubernetesdockeraws

CompanyZoom Communications

DepartmentEngineering

LocationSeattle, Washington, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Salary332,200

Posted June 22, 2026