remote

Principal Engineer, AI Platform & Infrastructure

Software Engineer

Lead the design and scaling of AI platform infrastructure, building robust deployment pipelines, GPU clusters, and observability tools to bring multimodal AI models from research to production for real‑time virtual try‑on experiences.

About the role

Key Responsibilities

Architect and maintain end‑to‑end ML platform, including data ingestion, model training, and serving pipelines.
Design and operate GPU‑enabled Kubernetes clusters for large‑scale multimodal inference.
Implement CI/CD workflows and containerization strategies (Docker, Helm) for rapid model deployment.
Build observability stack (metrics, logs, tracing) to monitor model performance and system health.
Collaborate with research, product, and operations teams to translate prototypes into production‑grade services.

Requirements

10+ years of software engineering experience with a focus on ML infrastructure.
Deep expertise in Python, Kubernetes, and GPU cluster management.
Proven track record of building scalable CI/CD pipelines and observability solutions.
Strong understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration.
Excellent communication skills and ability to mentor junior engineers.

Skills

pythonkubernetescicddocker

DepartmentEngineering

LocationSan Francisco, California, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 23, 2026