onsite
Senior Site Reliability Engineer - Charles Schwab
Site Reliability Engineer
Senior Site Reliability Engineer driving reliability, scalability, and automation for cloud-native services using Kubernetes, Docker, AWS, Terraform, and advanced monitoring. Lead incident response, improve observability, and shape SRE best practices across production environments.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS using Terraform and Kubernetes.
- Develop and refine CI/CD pipelines, ensuring rapid, reliable deployments with automated testing and rollbacks.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve system resilience.
- Implement and enhance monitoring, alerting, and observability solutions (Prometheus, Grafana, ELK) to provide real‑time insights into system health.
- Collaborate with development teams to embed SRE principles into application design and code reviews.
Requirements
- 5+ years of experience in site reliability or DevOps roles, with a strong focus on cloud-native technologies.
- Proficiency in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, Helm, and CI/CD tools such as GitHub Actions or Jenkins.
- Deep understanding of monitoring, logging, and alerting best practices.
- Excellent communication skills and a proven ability to work collaboratively in cross‑functional teams.
Skills
kubernetesdockerawsterraformcicd