onsite

Site Reliability Engineer - Artificial Labs

Site Reliability Engineer

Lead the design, deployment, and maintenance of scalable, highly available services on Kubernetes and AWS, using Docker, Prometheus, Grafana, Terraform, and CI/CD pipelines to ensure reliability, performance, and rapid incident response.

About the role

Key Responsibilities

Design, implement, and maintain production-grade Kubernetes clusters and Docker-based microservices across AWS environments.
Build and manage observability stack with Prometheus, Grafana, and alerting to detect and resolve incidents proactively.
Automate infrastructure provisioning and configuration using Terraform, ensuring repeatable, version-controlled deployments.
Collaborate with development teams to embed SRE best practices into CI/CD pipelines and release processes.
Lead root‑cause analysis, post‑mortem documentation, and continuous improvement initiatives to reduce MTTR and improve system resilience.

Requirements

5+ years of experience in site reliability engineering or DevOps roles.
Strong proficiency with Kubernetes, Docker, and cloud-native tooling.
Hands‑on experience with Prometheus, Grafana, and alerting systems.
Expertise in AWS services (EKS, EC2, S3, CloudWatch) and IaC with Terraform.
Solid scripting skills (Python, Bash) and familiarity with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).

Skills

kubernetesdockerprometheusgrafanaawsterraformcicd

CompanyArtificial Labs

DepartmentEngineering

LocationLondon, ENG, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 19, 2026