onsite
Site Reliability Engineer - ACI Worldwide
Site Reliability Engineer
Site Reliability Engineer focused on building and operating high‑traffic payment systems using Python, Go, Kubernetes, and AWS, while implementing SLOs, error budgets, and chaos engineering to ensure reliability and performance.
About the role
Key Responsibilities
- Design, develop, and deploy scalable, highly available services in Python and Go, leveraging Kubernetes and AWS infrastructure.
- Implement and maintain Service Level Objectives (SLOs), error budget policies, and actionable alerting to drive reliability metrics.
- Lead incident response, conduct post‑mortems, and drive continuous improvement through chaos testing and root‑cause analysis.
- Collaborate with product and engineering teams to embed reliability practices into the development lifecycle.
- Participate in on‑call rotations, ensuring 24/7 availability and rapid incident resolution.
Requirements
- 3+ years of experience in site reliability or DevOps roles, with strong coding skills in Python or Go.
- Hands‑on experience with Kubernetes, container orchestration, and AWS services (EC2, EKS, S3, CloudWatch).
- Proficiency in defining and monitoring SLOs, error budgets, and alerting frameworks.
- Experience with chaos engineering, incident management, and post‑mortem processes.
- Excellent communication skills and a collaborative mindset.
Skills
pythongokubernetesaws