remote
Site Reliability Engineer - TikTok
Site Reliability Engineer
Site Reliability Engineer responsible for ensuring stability, observability, and cost‑effective operation of TikTok's core services at global scale, leveraging automation, incident response, and cloud infrastructure tools.
About the role
Key Responsibilities
- Maintain and improve the reliability of core services, responding to production incidents with rapid root‑cause analysis and remediation.
- Design, implement, and operate monitoring, alerting, and observability platforms (e.g., Prometheus, Grafana) to provide real‑time insight into system health.
- Develop and maintain infrastructure-as-code using Terraform and automate deployment pipelines for scalable cloud environments on AWS.
- Collaborate with engineering teams to build self‑service tools and platforms that enhance incident handling efficiency and reduce manual toil.
- Continuously evaluate performance, capacity, and cost metrics to drive optimizations and ensure service availability 24/7.
Requirements
- 3+ years of experience in site reliability, DevOps, or production engineering roles.
- Strong proficiency in Python scripting and Linux system administration.
- Hands‑on experience with container orchestration (Kubernetes) and cloud platforms, preferably AWS.
- Solid understanding of infrastructure‑as‑code (Terraform) and CI/CD pipelines.
- Familiarity with monitoring and observability tools such as Prometheus and Grafana.
Skills
pythonkubernetesawsterraformprometheuslinux