remoteonsite
Sr.Specialist, AI Site Reliability Engineer - Core Enterprise Services - Charles Schwab
Site Reliability Engineer
Senior AI Site Reliability Engineer responsible for designing, deploying, and maintaining AI-driven services on AWS, leveraging Kubernetes and Python to ensure high availability, performance, and scalability for core enterprise applications.
About the role
Key Responsibilities
- Design, implement, and manage AI/ML workloads on AWS, ensuring high availability and fault tolerance.
- Build and maintain Kubernetes clusters, automating deployment pipelines with CI/CD tools.
- Monitor system performance, troubleshoot incidents, and implement proactive reliability improvements.
- Collaborate with data science and software teams to integrate ML models into production workflows.
- Develop observability solutions using Prometheus, Grafana, and CloudWatch.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Strong proficiency in Python and container orchestration with Kubernetes.
- Hands‑on experience with AWS services (EKS, ECS, S3, Lambda, CloudWatch).
- Knowledge of ML Ops practices and model deployment pipelines.
- Excellent problem‑solving skills and a collaborative mindset.
Skills
pythonkubernetesaws