remote
Senior Manager, Site Reliability Engineering - Brinker International
Software Engineer
Lead a high‑performing Site Reliability Engineering team, driving reliability, automation, and scalability for critical restaurant support platforms using Kubernetes, AWS, Terraform, and modern CI/CD practices.
About the role
Key Responsibilities
- Lead, mentor, and grow a team of SRE engineers to deliver highly available, performant services for restaurant operations.
- Design and implement cloud‑native architectures on AWS, leveraging Kubernetes, Terraform, and serverless components.
- Develop and maintain CI/CD pipelines, automated testing, and release processes to accelerate safe deployments.
- Establish observability standards using Prometheus, Grafana, and logging solutions; drive proactive monitoring and alerting.
- Own incident response, root‑cause analysis, and post‑mortem processes to continuously improve system reliability.
- Collaborate with product, security, and infrastructure teams to embed reliability and scalability into the development lifecycle.
Requirements
- 5+ years of hands‑on SRE or DevOps experience, with at least 2 years in a people‑management role.
- Deep expertise in Kubernetes orchestration, AWS services, and infrastructure‑as‑code (Terraform or CloudFormation).
- Proficiency in scripting or programming languages such as Python for automation and tooling.
- Strong background in CI/CD tooling (Jenkins, GitLab CI, GitHub Actions) and automated testing frameworks.
- Demonstrated ability to lead incident management, perform root‑cause analysis, and drive continuous improvement.
Skills
kubernetesawsterraformpythoncicd