remote
Site Reliability Engineer, Platform Responsibility - USDS - TikTok USDS JV
Site Reliability Engineer
Site Reliability Engineer focused on platform reliability, scaling ML-driven abuse detection systems on AWS using Kubernetes, Python, and advanced observability tools to ensure uptime and performance for billions of users.
About the role
Key Responsibilities
- Design, deploy, and maintain highly available Kubernetes clusters that run machine learning inference pipelines for abuse detection.
- Implement CI/CD pipelines and automated testing to accelerate feature delivery while ensuring reliability.
- Develop and maintain observability stack (metrics, logs, traces) to detect, diagnose, and remediate incidents in real time.
- Collaborate with data science and security teams to optimize model performance and reduce false positives.
- Automate scaling, patching, and disaster recovery processes across AWS infrastructure.
Requirements
- 5+ years of SRE or DevOps experience in a large-scale, data‑intensive environment.
- Proficiency with Kubernetes, Helm, and container orchestration best practices.
- Strong scripting skills in Python and experience with CI/CD tools such as GitHub Actions or Jenkins.
- Hands‑on experience with AWS services (EKS, EC2, S3, CloudWatch, Lambda).
- Deep understanding of observability, monitoring, and incident response.
Skills
kubernetesawspythonmachine learningcicd