remoteonsite

Specialist, AI Site Reliability Engineer - Core Enterprise Services - Charles Schwab

Site Reliability Engineer

Lead AI infrastructure reliability as a Site Reliability Engineer, ensuring high availability and performance of AI services on Kubernetes and AWS. Leverage Python, Terraform, and advanced monitoring to automate deployments and maintain robust, scalable systems.

About the role

Key Responsibilities

Design, implement, and maintain highly available AI workloads on Kubernetes clusters in AWS.
Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
Develop and maintain Python-based monitoring, alerting, and incident response tooling.
Collaborate with data science and ML teams to optimize model deployment pipelines and resource utilization.
Implement security best practices, including IAM, network policies, and secrets management.
Participate in on‑call rotations, root cause analysis, and post‑mortem documentation.

Requirements

5+ years of experience in Site Reliability Engineering or DevOps roles.
Strong proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Hands‑on experience with Terraform, Python scripting, and CI/CD tools (GitHub Actions, Jenkins).
Deep understanding of monitoring, logging, and alerting frameworks (Prometheus, Grafana, ELK).
Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.

Skills

kubernetesdockerawspythonterraform

CompanyCharles Schwab

DepartmentEngineering

LocationTelangana, India

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026