onsite
Senior ML Infrastructure Engineer - Finoit Inc.
Devops Engineer
Lead the design and scaling of high‑performance GPU training platforms, optimizing distributed PyTorch pipelines on Kubernetes to empower ML researchers with faster, more reliable model training.
About the role
Key Responsibilities
- Architect and maintain scalable GPU clusters for large‑scale PyTorch training workloads.
- Design and optimize distributed training pipelines, leveraging Kubernetes and container orchestration.
- Implement CI/CD workflows for model training, monitoring, and deployment.
- Collaborate with ML researchers to improve developer experience and reduce training time.
- Monitor system performance, troubleshoot bottlenecks, and drive continuous improvement.
Requirements
- 5+ years of experience building ML infrastructure, with deep knowledge of PyTorch and distributed training.
- Proficiency in Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
- Strong scripting skills (Python, Bash) and familiarity with CI/CD tools.
- Experience with GPU cluster management, performance tuning, and cost optimization.
- Excellent problem‑solving skills and a collaborative mindset.
Skills
pytorchkubernetesaws