remote

Site Reliability Engineer - TikTok

Site Reliability Engineer

Site Reliability Engineer responsible for ensuring stability, observability, and cost‑effective operation of TikTok's core services at global scale, leveraging automation, incident response, and cloud infrastructure tools.

About the role

Key Responsibilities

Maintain and improve the reliability of core services, responding to production incidents with rapid root‑cause analysis and remediation.
Design, implement, and operate monitoring, alerting, and observability platforms (e.g., Prometheus, Grafana) to provide real‑time insight into system health.
Develop and maintain infrastructure-as-code using Terraform and automate deployment pipelines for scalable cloud environments on AWS.
Collaborate with engineering teams to build self‑service tools and platforms that enhance incident handling efficiency and reduce manual toil.
Continuously evaluate performance, capacity, and cost metrics to drive optimizations and ensure service availability 24/7.

Requirements

3+ years of experience in site reliability, DevOps, or production engineering roles.
Strong proficiency in Python scripting and Linux system administration.
Hands‑on experience with container orchestration (Kubernetes) and cloud platforms, preferably AWS.
Solid understanding of infrastructure‑as‑code (Terraform) and CI/CD pipelines.
Familiarity with monitoring and observability tools such as Prometheus and Grafana.

Skills

pythonkubernetesawsterraformprometheuslinux

CompanyTikTok

DepartmentEngineering

LocationSan Jose, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 26, 2026