remoteonsite
Sr Manager, AI Site Reliability Engineer - Core Enterprise Services - Charles Schwab
Site Reliability Engineer
Lead AI Site Reliability Engineering for core enterprise services, driving scalable, resilient infrastructure on AWS with Kubernetes, Docker, and Terraform, while applying Python for automation and monitoring.
About the role
Key Responsibilities
- Design, implement, and maintain highly available AI workloads on AWS using Kubernetes and Docker.
- Develop and manage CI/CD pipelines with Terraform, ensuring secure and repeatable deployments.
- Automate monitoring, alerting, and incident response using Python scripts and cloud-native tools.
- Collaborate with data science and product teams to optimize AI model performance and reliability.
- Lead capacity planning, cost optimization, and performance tuning for large-scale AI services.
Requirements
- 10+ years of experience in site reliability engineering with a focus on AI/ML workloads.
- Proficiency in AWS services (EKS, ECS, Lambda, CloudWatch) and Kubernetes cluster management.
- Strong scripting skills in Python and experience with Terraform or similar IaC tools.
- Deep understanding of containerization, CI/CD, and observability best practices.
- Excellent communication and leadership skills, with a track record of mentoring teams.
Skills
kubernetesdockerawspythonterraform