remote

Senior Site Reliability Engineer, Reliability Team - USDS - TikTok USDS JV

Site Reliability Engineer

Senior Site Reliability Engineer driving reliability, observability, and automation for a global, fault‑tolerant infrastructure using Python, Go, Kubernetes, AWS, and Prometheus.

About the role

Key Responsibilities

Own end‑to‑end reliability of production services, ensuring high availability and performance at a global scale.
Design, implement, and maintain observability pipelines with Prometheus, Grafana, and custom telemetry collectors.
Automate deployment, scaling, and configuration using Kubernetes, Terraform, and CI/CD pipelines.
Collaborate with software, security, and operations teams to define SLAs, run post‑mortems, and drive continuous improvement.
Develop and maintain tooling in Python and Go to support monitoring, alerting, and incident response.

Requirements

5+ years of SRE or DevOps experience in large‑scale, distributed environments.
Proficiency with Kubernetes, AWS services (EC2, EKS, CloudWatch), and infrastructure as code.
Strong scripting skills in Python and/or Go, with experience building observability tools.
Deep understanding of monitoring, alerting, and incident management best practices.
Excellent communication and collaboration skills across cross‑functional teams.

Skills

pythongokubernetesawsprometheus

CompanyTikTok USDS JV

DepartmentEngineering

LocationSan Jose, CA, United States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 20, 2026