onsite
Site Reliability Engineer - Manchester - NS West 1 - BAE Systems
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud infrastructure, automating CI/CD pipelines, and ensuring robust monitoring and incident response across Kubernetes clusters using AWS services.
About the role
Key Responsibilities
- Design, implement, and manage scalable Kubernetes clusters on AWS, ensuring high availability and performance.
- Develop and maintain CI/CD pipelines with GitHub Actions, Terraform, and Helm for automated application delivery.
- Implement comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack.
- Lead incident response, root cause analysis, and post‑mortem documentation to improve system reliability.
- Collaborate with development teams to enforce best practices in code quality, security, and infrastructure as code.
Requirements
- 5+ years of experience in site reliability or DevOps roles.
- Proficient with Kubernetes, Docker, and cloud platforms (AWS preferred).
- Strong scripting skills in Python or Bash and experience with Terraform or CloudFormation.
- Hands‑on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and monitoring solutions.
- Excellent problem‑solving skills and a proactive approach to automation and reliability.
Skills
kubernetesdockercicdawspython