onsite
Senior Site Reliability Engineer - Retail Mobility - T Mobile
Site Reliability Engineer
Senior Site Reliability Engineer responsible for building resilient digital infrastructure, automating deployments, and improving reliability for retail mobility services using Kubernetes, Terraform, Python, CI/CD pipelines, and AWS.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure for retail mobility applications.
- Automate provisioning, configuration, and deployment workflows using Terraform, Python scripts, and CI/CD pipelines.
- Monitor system health, set up alerting, and perform root‑cause analysis to reduce incident frequency and mean time to recovery.
- Collaborate with development teams to embed reliability best practices into the software development lifecycle.
- Continuously improve observability, capacity planning, and performance tuning across cloud (AWS) environments.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, with a strong focus on automation.
- Proficiency in Kubernetes orchestration and infrastructure‑as‑code tools such as Terraform.
- Solid programming/scripting skills in Python and experience building CI/CD pipelines (e.g., Jenkins, GitLab CI, or GitHub Actions).
- Hands‑on experience with AWS services (EC2, EKS, S3, CloudWatch) and monitoring solutions (Prometheus, Grafana, or similar).
- Strong problem‑solving abilities, excellent communication, and a track record of driving reliability improvements in production systems.
Skills
kubernetesterraformpythoncicdaws