remote

Site Reliability Engineer II - American Express

Site Reliability Engineer

Site Reliability Engineer II responsible for designing, deploying, and maintaining highly available services on Kubernetes and AWS, automating infrastructure with Terraform, and ensuring 24/7 uptime through monitoring, alerting, and incident response.

About the role

Key Responsibilities

Design, implement, and operate scalable, highly available services on Kubernetes clusters in AWS.
Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
Implement and maintain monitoring, alerting, and logging solutions with Prometheus, Grafana, and ELK stack.
Lead incident response, root‑cause analysis, and post‑mortem documentation to improve reliability.
Collaborate with development teams to embed SRE best practices into the software development lifecycle.

Requirements

3+ years of experience in site reliability engineering or DevOps roles.
Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Hands‑on experience with Terraform, CI/CD (GitHub Actions, Jenkins, ArgoCD).
Strong scripting skills in Bash or Python for automation.
Excellent problem‑solving skills and a proactive approach to improving system reliability.

Skills

kubernetesdockerawsterraform

CompanyAmerican Express

DepartmentEngineering

LocationSunrise, FL, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026