Site Reliability Engineer - Machine Learning Systems
As a Site Reliability Engineer focusing on Machine Learning Systems at ByteDance, you will be responsible for ensuring the efficient and stable operation of large-scale ML systems for model deployment, training, and inference. This role involves building and maintaining distributed ML infrastructure, optimizing resource utilization, and participating in global on-call support for critical systems.
The ByteDance Large Model Team is dedicated to advancing AI large model technology, aiming to be a world-class research team. The team focuses on NLP, CV, speech, and other areas, leveraging abundant data and computing resources to develop its general large model with multi-modal capabilities. The Machine Learning (ML) System sub-team integrates system engineering and machine learning to build and maintain massively distributed ML training and inference systems/services globally, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In this role, you will contribute to building a large-scale heterogeneous system integrating with GPU/NPU/RDMA/Storage, enhancing your expertise in coding, performance analysis, and distributed systems, and participating in decision-making processes within a global team.
Posted June 10, 2026