remoteonsite
Specialist, AI Site Reliability Engineer - Core Enterprise Services - Charles Schwab
Site Reliability Engineer
Lead AI infrastructure reliability as a Site Reliability Engineer, ensuring high availability and performance of AI services on Kubernetes and AWS. Leverage Python, Terraform, and advanced monitoring to automate deployments and maintain robust, scalable systems.
About the role
Key Responsibilities
- Design, implement, and maintain highly available AI workloads on Kubernetes clusters in AWS.
- Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
- Develop and maintain Python-based monitoring, alerting, and incident response tooling.
- Collaborate with data science and ML teams to optimize model deployment pipelines and resource utilization.
- Implement security best practices, including IAM, network policies, and secrets management.
- Participate in on‑call rotations, root cause analysis, and post‑mortem documentation.
Requirements
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Strong proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, Python scripting, and CI/CD tools (GitHub Actions, Jenkins).
- Deep understanding of monitoring, logging, and alerting frameworks (Prometheus, Grafana, ELK).
- Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.
Skills
kubernetesdockerawspythonterraform