remote
Principal Engineer, AI Platform & Infrastructure
Software Engineer
Lead the design and scaling of AI platform infrastructure, building robust deployment pipelines, GPU clusters, and observability tools to bring multimodal AI models from research to production for real‑time virtual try‑on experiences.
About the role
Key Responsibilities
- Architect and maintain end‑to‑end ML platform, including data ingestion, model training, and serving pipelines.
- Design and operate GPU‑enabled Kubernetes clusters for large‑scale multimodal inference.
- Implement CI/CD workflows and containerization strategies (Docker, Helm) for rapid model deployment.
- Build observability stack (metrics, logs, tracing) to monitor model performance and system health.
- Collaborate with research, product, and operations teams to translate prototypes into production‑grade services.
Requirements
- 10+ years of software engineering experience with a focus on ML infrastructure.
- Deep expertise in Python, Kubernetes, and GPU cluster management.
- Proven track record of building scalable CI/CD pipelines and observability solutions.
- Strong understanding of cloud platforms (AWS, GCP, or Azure) and container orchestration.
- Excellent communication skills and ability to mentor junior engineers.
Skills
pythonkubernetescicddocker