onsite
Site Reliability Engineer - Manchester - BAE Systems
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud-native services using Kubernetes, Docker, and AWS, while ensuring robust monitoring, alerting, and automation across production environments.
About the role
Key Responsibilities
- Design, implement, and manage scalable Kubernetes clusters and Docker-based microservices across AWS environments.
- Develop and maintain CI/CD pipelines using Git, Jenkins, and Terraform to automate deployments and infrastructure provisioning.
- Implement comprehensive monitoring, logging, and alerting with Prometheus, Grafana, and ELK stack to ensure high availability and performance.
- Collaborate with development teams to enforce best practices for code quality, security, and observability.
- Conduct root cause analysis, post‑incident reviews, and continuous improvement initiatives to reduce MTTR and prevent recurrence.
Requirements
- 5+ years of experience in site reliability or DevOps roles within cloud-native environments.
- Proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience building CI/CD pipelines and IaC with Terraform or CloudFormation.
- Strong scripting skills in Python or Bash and familiarity with monitoring tools such as Prometheus and Grafana.
- Excellent problem‑solving abilities, strong communication skills, and a proactive, collaborative mindset.
Skills
kubernetesdockercicdawsprometheusgrafanaterraform