onsite

Senior ML Infrastructure Engineer - Finoit Inc.

Devops Engineer

Lead the design and scaling of high‑performance GPU training platforms, optimizing distributed PyTorch pipelines on Kubernetes to empower ML researchers with faster, more reliable model training.

About the role

Key Responsibilities

Architect and maintain scalable GPU clusters for large‑scale PyTorch training workloads.
Design and optimize distributed training pipelines, leveraging Kubernetes and container orchestration.
Implement CI/CD workflows for model training, monitoring, and deployment.
Collaborate with ML researchers to improve developer experience and reduce training time.
Monitor system performance, troubleshoot bottlenecks, and drive continuous improvement.

Requirements

5+ years of experience building ML infrastructure, with deep knowledge of PyTorch and distributed training.
Proficiency in Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
Strong scripting skills (Python, Bash) and familiarity with CI/CD tools.
Experience with GPU cluster management, performance tuning, and cost optimization.
Excellent problem‑solving skills and a collaborative mindset.

Skills

pytorchkubernetesaws

CompanyFinoit Inc.

DepartmentEngineering

LocationPortola Valley, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 23, 2026