onsite

AI Infrastructure Engineer - Zoom

Devops Engineer

Lead the design and scaling of large‑scale training infrastructure for LLMs, leveraging Python, PyTorch, TensorFlow, Kubernetes, Docker, and AWS to deliver high‑performance, distributed AI systems.

About the role

Key Responsibilities

Architect and maintain scalable GPU cluster environments for large‑scale LLM training.
Implement and optimize distributed training pipelines using PyTorch and TensorFlow.
Automate deployment and scaling with Kubernetes, Docker, and CI/CD workflows.
Collaborate with data scientists to tune model training performance and resource utilization.
Monitor system health, troubleshoot bottlenecks, and implement cost‑effective solutions on AWS.

Requirements

5+ years of experience building AI infrastructure for large‑scale models.
Proficiency in Python, PyTorch, TensorFlow, and distributed training techniques.
Hands‑on experience with Kubernetes, Docker, and cloud services (AWS, GCP, or Azure).
Strong understanding of GPU cluster management, networking, and performance tuning.
Excellent problem‑solving skills and ability to work cross‑functionally in a fast‑paced environment.

Skills

pythonpytorchtensorflowkubernetesdockeraws

CompanyZoom

DepartmentEngineering

LocationSeattle, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 24, 2026