remote
Development Architect - AI Foundation Model Training and Serving - SAP
Software Engineer
Lead the design and implementation of AI foundation model training pipelines and serving infrastructure, leveraging Python, PyTorch/TensorFlow, Kubernetes, Docker, and cloud services to deliver scalable, production‑ready AI solutions.
About the role
Key Responsibilities
- Architect and build end‑to‑end AI foundation model training pipelines, ensuring reproducibility, scalability, and performance.
- Design and maintain model serving infrastructure on Kubernetes, Docker, and cloud platforms (AWS/GCP), enabling low‑latency inference.
- Collaborate with data scientists, ML engineers, and DevOps teams to integrate best practices in MLOps, CI/CD, and monitoring.
- Optimize resource utilization and cost across training and inference workloads, applying advanced scheduling and autoscaling techniques.
- Document architecture, design decisions, and operational procedures for cross‑functional teams.
Requirements
- 5+ years of experience in AI/ML engineering with a focus on large‑scale model training and serving.
Skills
pythonpytorchtensorflowkubernetesdockerawsmlops