remote
Site Reliability Engineer II - American Express
Site Reliability Engineer
Site Reliability Engineer II responsible for designing, deploying, and maintaining highly available services on Kubernetes and AWS, automating infrastructure with Terraform, and ensuring 24/7 uptime through monitoring, alerting, and incident response.
About the role
Key Responsibilities
- Design, implement, and operate scalable, highly available services on Kubernetes clusters in AWS.
- Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
- Implement and maintain monitoring, alerting, and logging solutions with Prometheus, Grafana, and ELK stack.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to improve reliability.
- Collaborate with development teams to embed SRE best practices into the software development lifecycle.
Requirements
- 3+ years of experience in site reliability engineering or DevOps roles.
- Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, CI/CD (GitHub Actions, Jenkins, ArgoCD).
- Strong scripting skills in Bash or Python for automation.
- Excellent problem‑solving skills and a proactive approach to improving system reliability.
Skills
kubernetesdockerawsterraform