onsite

Site Reliability Engineer - Machine Learning Systems

As a Site Reliability Engineer focusing on Machine Learning Systems at ByteDance, you will be responsible for ensuring the efficient and stable operation of large-scale ML systems for model deployment, training, and inference. This role involves building and maintaining distributed ML infrastructure, optimizing resource utilization, and participating in global on-call support for critical systems.

About the role

About the Role

The ByteDance Large Model Team is dedicated to advancing AI large model technology, aiming to be a world-class research team. The team focuses on NLP, CV, speech, and other areas, leveraging abundant data and computing resources to develop its general large model with multi-modal capabilities. The Machine Learning (ML) System sub-team integrates system engineering and machine learning to build and maintain massively distributed ML training and inference systems/services globally, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI. In this role, you will contribute to building a large-scale heterogeneous system integrating with GPU/NPU/RDMA/Storage, enhancing your expertise in coding, performance analysis, and distributed systems, and participating in decision-making processes within a global team.

Responsibilities

Ensure ML systems operate efficiently for large model deployment, training, evaluation, and inference.
Maintain the stability of offline tasks/services across multi-data center, multi-region, and multi-cloud environments.
Manage and plan computing and storage resources, including cost and budget.
Oversee global system disaster recovery, cluster machine governance, business service stability, resource utilization, and operation efficiency improvements.
Develop software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
Participate in the global team roster for system and business on-call support.

Minimum Qualifications

Bachelor's degree or above in Computer Science, computer engineering, or related fields.
Strong proficiency in at least one programming language such as Go, Python, or Shell in a Linux environment.
Strong hands-on experience with Kubernetes and containers, with more than 1 year of relevant operation and maintenance experience.

Preferred Qualifications

Experience in the operation and maintenance of large-scale ML distributed systems.
Experience in operation and maintenance of GPU servers.
Excellent logical analysis ability, capable of abstracting and splitting business logic effectively.
Strong sense of responsibility, good learning ability, communication skills, self-driven, and strong team spirit.
Good documentation principles and habits, able to write and update workflow and technical documentation as required.

About the role

About the Role

Responsibilities

Ensure ML systems operate efficiently for large model deployment, training, evaluation, and inference.
Maintain the stability of offline tasks/services across multi-data center, multi-region, and multi-cloud environments.
Manage and plan computing and storage resources, including cost and budget.
Oversee global system disaster recovery, cluster machine governance, business service stability, resource utilization, and operation efficiency improvements.
Develop software tools, products, and systems for efficient monitoring and management of ML infrastructure and services.
Participate in the global team roster for system and business on-call support.

Minimum Qualifications

Bachelor's degree or above in Computer Science, computer engineering, or related fields.
Strong proficiency in at least one programming language such as Go, Python, or Shell in a Linux environment.
Strong hands-on experience with Kubernetes and containers, with more than 1 year of relevant operation and maintenance experience.

Preferred Qualifications

Experience in the operation and maintenance of large-scale ML distributed systems.
Experience in operation and maintenance of GPU servers.
Excellent logical analysis ability, capable of abstracting and splitting business logic effectively.
Strong sense of responsibility, good learning ability, communication skills, self-driven, and strong team spirit.
Good documentation principles and habits, able to write and update workflow and technical documentation as required.

Site Reliability Engineer - Machine Learning Systems

About the role

About the Role

Responsibilities

Minimum Qualifications

Preferred Qualifications

Site Reliability Engineer - Machine Learning Systems

About the role

About the Role

Responsibilities

Minimum Qualifications

Preferred Qualifications

Skills