onsite
Senior AI/ML Engineer Postdoc - LLM Infrastructure - Technische Informationsbibliothek
ML Engineer
Lead the design, deployment, and scaling of Large Language Model infrastructure, leveraging Python, deep‑learning frameworks, and cloud/Kubernetes technologies to support cutting‑edge research and services.
About the role
Key Responsibilities
- Architect, implement, and maintain scalable infrastructure for training and serving Large Language Models (LLMs) in a cloud‑native environment.
- Develop robust pipelines using Python, PyTorch or TensorFlow for data preprocessing, model fine‑tuning, and evaluation.
- Containerize applications with Docker and orchestrate workloads on Kubernetes clusters, ensuring high availability and resource efficiency.
- Integrate infrastructure with AWS services (e.g., S3, EC2, SageMaker) and on‑premise HPC resources to meet performance and cost targets.
- Collaborate with research scientists to translate experimental prototypes into production‑ready services.
Requirements
- Ph.D. in Computer Science, Machine Learning, or a related field, with a strong publication record in LLMs or deep learning.
- Extensive hands‑on experience with Python and major deep‑learning frameworks (PyTorch, TensorFlow).
- Proven expertise in containerization (Docker) and orchestration (Kubernetes) for AI workloads.
- Solid understanding of cloud platforms, preferably AWS, and experience with large‑scale distributed training.
- Ability to work independently, mentor junior team members, and communicate complex technical concepts effectively.
Skills
pythonpytorchtensorflowkubernetesdockeraws