onsite
AI Infrastructure Engineer - Zoom
Devops Engineer
Lead the design and scaling of large‑scale training infrastructure for LLMs, leveraging Python, PyTorch, TensorFlow, Kubernetes, Docker, and AWS to deliver high‑performance, distributed AI systems.
About the role
Key Responsibilities
- Architect and maintain scalable GPU cluster environments for large‑scale LLM training.
- Implement and optimize distributed training pipelines using PyTorch and TensorFlow.
- Automate deployment and scaling with Kubernetes, Docker, and CI/CD workflows.
- Collaborate with data scientists to tune model training performance and resource utilization.
- Monitor system health, troubleshoot bottlenecks, and implement cost‑effective solutions on AWS.
Requirements
- 5+ years of experience building AI infrastructure for large‑scale models.
- Proficiency in Python, PyTorch, TensorFlow, and distributed training techniques.
- Hands‑on experience with Kubernetes, Docker, and cloud services (AWS, GCP, or Azure).
- Strong understanding of GPU cluster management, networking, and performance tuning.
- Excellent problem‑solving skills and ability to work cross‑functionally in a fast‑paced environment.
Skills
pythonpytorchtensorflowkubernetesdockeraws