onsite
Software Engineer - AI Infrastructure / Training / Inference
Software Engineer
Senior software engineer building scalable, high‑performance inference and training pipelines for multimodal AI, focusing on GPU orchestration, distributed infrastructure, and cost‑efficient, reliable production systems.
About the role
Key Responsibilities
- Design and implement scalable model serving and inference pipelines for multimodal AI workloads.
- Build and maintain distributed GPU infrastructure to support large‑scale training and inference.
- Optimize performance and cost across compute, storage, and networking layers.
- Develop observability, monitoring, and alerting solutions to ensure reliability and rapid incident response.
- Collaborate with applied scientists to create developer platforms that accelerate experimentation while preserving production quality.
Requirements
- 5+ years of software engineering experience in high‑performance, distributed systems.
- Strong proficiency in Python and experience with GPU‑accelerated frameworks (e.g., PyTorch, TensorFlow).
- Hands‑on knowledge of container orchestration (Kubernetes) and GPU scheduling.
- Deep understanding of performance profiling, profiling tools, and cost‑optimization techniques.
- Experience with observability stacks (Prometheus, Grafana, OpenTelemetry) and incident‑driven reliability practices.