onsite

Site Reliability Engineer - FalconSmartIT

Site Reliability Engineer

Lead the modernization of IT operations by implementing observability, automating toil, and ensuring scalable, reliable systems using SRE principles, cloud platforms, and scripting.

About the role

Key Responsibilities

Design and implement observability solutions across cloud and on‑prem environments to provide end‑to‑end visibility.
Automate repetitive operational tasks using Python, shell scripting, and IaC tools to reduce toil and improve reliability.
Collaborate with product and engineering teams to embed SRE best practices into the development lifecycle.
Monitor system health, analyze incidents, and drive post‑mortem investigations to prevent recurrence.
Manage capacity planning, performance tuning, and cost optimization for AWS and hybrid infrastructures.

Requirements

Proven experience as an SRE or DevOps engineer in a fast‑moving environment.
Strong knowledge of observability stacks (Prometheus, Grafana, Loki, ELK) and alerting frameworks.
Hands‑on scripting in Python and experience with CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD).
Experience with AWS services (EC2, ECS/EKS, CloudWatch, CloudFormation) and IaC (Terraform, CloudFormation).
Excellent problem‑solving skills, communication, and a passion for continuous improvement.

Skills

pythonaws

CompanyFalconSmartIT

DepartmentEngineering

LocationENG, United Kingdom

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 24, 2026