onsite

Software Engineer - AI Infrastructure / Training / Inference

Software Engineer

Senior software engineer building scalable, high‑performance inference and training pipelines for multimodal AI, focusing on GPU orchestration, distributed infrastructure, and cost‑efficient, reliable production systems.

About the role

Key Responsibilities

Design and implement scalable model serving and inference pipelines for multimodal AI workloads.
Build and maintain distributed GPU infrastructure to support large‑scale training and inference.
Optimize performance and cost across compute, storage, and networking layers.
Develop observability, monitoring, and alerting solutions to ensure reliability and rapid incident response.
Collaborate with applied scientists to create developer platforms that accelerate experimentation while preserving production quality.

Requirements

5+ years of software engineering experience in high‑performance, distributed systems.
Strong proficiency in Python and experience with GPU‑accelerated frameworks (e.g., PyTorch, TensorFlow).
Hands‑on knowledge of container orchestration (Kubernetes) and GPU scheduling.
Deep understanding of performance profiling, profiling tools, and cost‑optimization techniques.
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident‑driven reliability practices.

Skills

python

DepartmentEngineering

LocationSan Francisco, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026