onsite
Site Reliability Engineer - Production Support - Charles Schwab
Site Reliability Engineer
Site Reliability Engineer focused on production support, ensuring high availability of cloud‑native services using AWS, Python, and Bash. Responsibilities include monitoring, incident response, and continuous improvement of reliability practices.
About the role
Key Responsibilities
- Maintain and enhance the reliability of production services hosted on AWS, ensuring 99.9% uptime.
- Implement and manage monitoring solutions with Prometheus and Grafana, creating dashboards and alerting rules.
- Lead incident response, perform root‑cause analysis, and drive post‑mortem documentation.
- Automate deployment pipelines and configuration management using Python scripts and CI/CD tools.
- Collaborate with development teams to embed reliability best practices into the software development lifecycle.
Requirements
- 3+ years of experience in site reliability or production support roles.
- Strong proficiency with AWS services (EC2, RDS, S3, CloudWatch).
- Hands‑on scripting skills in Python and Bash.
- Experience with monitoring/alerting tools such as Prometheus, Grafana, or similar.
- Excellent problem‑solving skills and ability to work under pressure during incidents.
Skills
awspythonbashprometheusgrafana