remote
Staff Site Reliability Engineer - Visa
Site Reliability Engineer
Lead the design, implementation, and operation of highly available, scalable infrastructure on AWS using Terraform, Kubernetes, and advanced monitoring, driving reliability and performance for global payment services.
About the role
Key Responsibilities
- Architect and maintain large‑scale, highly available infrastructure on AWS, leveraging Terraform for IaC and Kubernetes for container orchestration.
- Design and implement robust CI/CD pipelines, ensuring rapid, reliable deployments across multiple environments.
- Develop and enforce SRE best practices, including SLIs, SLOs, error budgets, and automated incident response.
- Collaborate with cross‑functional teams to troubleshoot, root‑cause, and resolve production incidents, driving continuous improvement.
- Implement advanced monitoring, alerting, and observability solutions to proactively detect and mitigate performance issues.
Requirements
- 10+ years of experience in site reliability engineering or related roles, with a strong focus on cloud infrastructure.
- Deep expertise in AWS services (EC2, RDS, S3, VPC, CloudWatch) and Terraform for infrastructure provisioning.
- Hands‑on experience with Kubernetes, container runtimes, and related ecosystem tools.
- Proficiency in scripting (Python, Bash) and automation of deployment pipelines.
- Strong understanding of distributed systems principles, performance tuning, and incident management.
Skills
awsterraformkubernetescicd