onsite
Senior Site Reliability Engineer - NAB - National Australia Bank
Site Reliability Engineer
Lead the design, deployment, and operation of highly available, scalable cloud services using Kubernetes, Docker, and AWS, while implementing robust monitoring, alerting, and automation to ensure optimal performance and reliability.
About the role
Key Responsibilities
- Architect, deploy, and maintain production-grade Kubernetes clusters and containerized workloads across AWS environments.
- Design and implement CI/CD pipelines, infrastructure as code (Terraform), and automated rollouts to accelerate feature delivery.
- Develop and maintain comprehensive monitoring, alerting, and observability solutions using Prometheus, Grafana, and related tooling.
- Collaborate with development, security, and product teams to define reliability SLAs, SLOs, and incident response procedures.
- Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce MTTR and prevent recurrence.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles, with a strong focus on cloud-native technologies.
- Proficient in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, CI/CD tools (GitHub Actions, Jenkins, ArgoCD), and scripting (Bash, Python).
- Deep understanding of monitoring, logging, and alerting best practices using Prometheus, Grafana, ELK, or similar stacks.
- Excellent problem‑solving skills, strong communication, and a proactive, customer‑centric mindset.
Skills
kubernetesdockercicdawsprometheusgrafanaterraform