remote

Site Reliability Engineer, Platform Responsibility - USDS - TikTok USDS JV

Site Reliability Engineer

Site Reliability Engineer focused on platform reliability, scaling ML-driven abuse detection systems on AWS using Kubernetes, Python, and advanced observability tools to ensure uptime and performance for billions of users.

About the role

Key Responsibilities

Design, deploy, and maintain highly available Kubernetes clusters that run machine learning inference pipelines for abuse detection.
Implement CI/CD pipelines and automated testing to accelerate feature delivery while ensuring reliability.
Develop and maintain observability stack (metrics, logs, traces) to detect, diagnose, and remediate incidents in real time.
Collaborate with data science and security teams to optimize model performance and reduce false positives.
Automate scaling, patching, and disaster recovery processes across AWS infrastructure.

Requirements

5+ years of SRE or DevOps experience in a large-scale, data‑intensive environment.
Proficiency with Kubernetes, Helm, and container orchestration best practices.
Strong scripting skills in Python and experience with CI/CD tools such as GitHub Actions or Jenkins.
Hands‑on experience with AWS services (EKS, EC2, S3, CloudWatch, Lambda).
Deep understanding of observability, monitoring, and incident response.

Skills

kubernetesawspythonmachine learningcicd

CompanyTikTok USDS JV

DepartmentEngineering

LocationSeattle, WA, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026