onsite
Senior Site Reliability Engineer I - American Express
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing and operating highly available, observable systems, automating deployments, and driving reliability best practices using cloud, container, and monitoring technologies.
About the role
Key Responsibilities
- Design, implement, and maintain scalable SRE solutions on AWS, including infrastructure as code with Terraform.
- Develop and manage container orchestration platforms (Kubernetes, Docker) to ensure high availability and performance.
- Build and enhance real‑time observability pipelines using Prometheus, Grafana, and custom metrics.
- Automate deployment, configuration, and incident response workflows with CI/CD pipelines and Python scripting.
- Collaborate with development, security, and product teams to embed reliability and automation into the software lifecycle.
Requirements
- 5+ years of experience in site reliability, DevOps, or systems engineering.
- Strong expertise with AWS services, Kubernetes, Docker, and Terraform.
- Proficiency in scripting/automation using Python and CI/CD tools (Jenkins, GitHub Actions, etc.).
- Hands‑on experience with monitoring and alerting stacks such as Prometheus and Grafana.
- Solid understanding of networking, Linux systems, and incident management processes.
Skills
kubernetesdockerawsterraformprometheuspythoncicd