onsite
Sr Manager, Site Reliability Engineering - FIS
Software Engineer
Lead a high‑performing SRE team to build and maintain a resilient, scalable payments platform using AWS, Kubernetes, and advanced monitoring, driving proactive reliability and performance improvements.
About the role
Key Responsibilities
- Lead and mentor a team of SRE engineers to design, implement, and operate highly available payment processing services on AWS.
- Architect and maintain Kubernetes clusters, CI/CD pipelines, and infrastructure-as-code for rapid, reliable deployments.
- Define and enforce reliability SLAs, run post‑incident reviews, and drive continuous improvement of incident response processes.
- Collaborate with development, security, and product teams to embed reliability best practices into the software development lifecycle.
- Implement and evolve monitoring, alerting, and observability solutions (Prometheus, Grafana, etc.) to detect and remediate performance bottlenecks.
Requirements
- 10+ years of experience in large‑scale distributed systems, with at least 5 years in a senior SRE or DevOps leadership role.
- Deep expertise in AWS services (EC2, RDS, ECS/EKS, CloudWatch) and Kubernetes cluster management.
- Proven track record building CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD) and automating infrastructure with Terraform or CloudFormation.
- Strong background in monitoring, alerting, and incident management using Prometheus, Grafana, and PagerDuty.
- Excellent communication skills and a collaborative mindset to work across engineering, product, and operations teams.