onsite

Senior Site Reliability Engineer SRE - 1global

Site Reliability Engineer

Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud infrastructure using Kubernetes, Docker, and AWS. Leverages monitoring tools like Prometheus and Grafana, automates with Terraform, and writes Python scripts to ensure system reliability and performance.

About the role

Key Responsibilities

Design, implement, and manage scalable Kubernetes clusters on AWS, ensuring high availability and fault tolerance.
Develop and maintain CI/CD pipelines using Terraform, GitHub Actions, and Docker to automate deployments and infrastructure changes.
Implement comprehensive monitoring and alerting with Prometheus, Grafana, and CloudWatch, and troubleshoot production incidents to minimize downtime.
Collaborate with development teams to enforce best practices for observability, logging, and security across services.
Conduct post‑mortem analyses, root cause investigations, and continuous improvement initiatives to enhance system reliability.

Requirements

5+ years of experience in site reliability engineering or DevOps roles.
Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Strong scripting skills in Python and experience with IaC tools like Terraform.
Hands‑on experience with Prometheus, Grafana, and ELK stack for monitoring and logging.
Excellent problem‑solving skills and ability to work in a fast‑paced, collaborative environment.

Skills

kubernetesdockerprometheusgrafanaawsterraformpython

Company1global

DepartmentEngineering

LocationBerlin, Germany

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 20, 2026