remote
Senior Site Reliability Engineer, Reliability Team - USDS - TikTok USDS JV
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, observability, and automation for a global, fault‑tolerant infrastructure using Python, Go, Kubernetes, AWS, and Prometheus.
About the role
Key Responsibilities
- Own end‑to‑end reliability of production services, ensuring high availability and performance at a global scale.
- Design, implement, and maintain observability pipelines with Prometheus, Grafana, and custom telemetry collectors.
- Automate deployment, scaling, and configuration using Kubernetes, Terraform, and CI/CD pipelines.
- Collaborate with software, security, and operations teams to define SLAs, run post‑mortems, and drive continuous improvement.
- Develop and maintain tooling in Python and Go to support monitoring, alerting, and incident response.
Requirements
- 5+ years of SRE or DevOps experience in large‑scale, distributed environments.
- Proficiency with Kubernetes, AWS services (EC2, EKS, CloudWatch), and infrastructure as code.
- Strong scripting skills in Python and/or Go, with experience building observability tools.
- Deep understanding of monitoring, alerting, and incident management best practices.
- Excellent communication and collaboration skills across cross‑functional teams.
Skills
pythongokubernetesawsprometheus