remoteonsite
Manager, AI Site Reliability Engineer & Operations - Core Enterprise Services - Charles Schwab
Site Reliability Engineer
Lead AI-driven Site Reliability Engineering initiatives, ensuring high‑availability, performance, and scalability of core enterprise services using Python, Kubernetes, AWS, Terraform, and modern CI/CD practices.
About the role
Key Responsibilities
- Design, implement, and operate highly reliable AI‑enabled services supporting core enterprise platforms.
- Develop automation scripts and infrastructure‑as‑code using Python and Terraform to provision and manage cloud resources on AWS.
- Build and maintain Kubernetes clusters, ensuring optimal scaling, monitoring, and incident response.
- Establish CI/CD pipelines that integrate testing, security, and deployment for AI workloads.
- Collaborate with cross‑functional teams to troubleshoot production issues, perform root‑cause analysis, and drive continuous improvement.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps, preferably in AI/ML environments.
- Strong proficiency in Python for automation and scripting.
- Hands‑on experience with Kubernetes orchestration and AWS cloud services.
- Expertise in infrastructure‑as‑code tools such as Terraform.
- Demonstrated ability to design and maintain CI/CD pipelines and monitoring solutions.
Skills
pythonkubernetesawsterraformcicd