remote
AI Infrastructure Engineer - Zoom Communications
Devops Engineer
Lead the design and scaling of large‑scale training pipelines for LLMs, leveraging Python, PyTorch, and Kubernetes on AWS to deliver high‑performance, cost‑efficient AI infrastructure.
About the role
Key Responsibilities
- Architect, build, and maintain end‑to‑end training pipelines for large language models, ensuring reliability and scalability.
- Optimize GPU utilization and memory management across distributed clusters, reducing training time and cost.
- Integrate CI/CD workflows with Docker and Kubernetes to automate model training, validation, and deployment.
- Collaborate with data scientists and ML engineers to translate research prototypes into production‑ready systems.
- Monitor system performance, troubleshoot bottlenecks, and implement proactive capacity planning.
Requirements
- 5+ years of experience in AI infrastructure or large‑scale distributed systems.
- Proficiency in Python, PyTorch, and TensorFlow, with hands‑on experience in distributed training frameworks.
- Strong background in Kubernetes, Docker, and cloud services (AWS, GCP, or Azure).
- Deep understanding of GPU architecture, CUDA, and performance tuning.
- Excellent problem‑solving skills and a track record of delivering production‑grade AI solutions.
Skills
pythonpytorchtensorflowkubernetesdockeraws