onsite
Site Reliability Engineer - FalconSmartIT
Site Reliability Engineer
Lead the modernization of IT operations by implementing observability, automating toil, and ensuring scalable, reliable systems using SRE principles, cloud platforms, and scripting.
About the role
Key Responsibilities
- Design and implement observability solutions across cloud and on‑prem environments to provide end‑to‑end visibility.
- Automate repetitive operational tasks using Python, shell scripting, and IaC tools to reduce toil and improve reliability.
- Collaborate with product and engineering teams to embed SRE best practices into the development lifecycle.
- Monitor system health, analyze incidents, and drive post‑mortem investigations to prevent recurrence.
- Manage capacity planning, performance tuning, and cost optimization for AWS and hybrid infrastructures.
Requirements
- Proven experience as an SRE or DevOps engineer in a fast‑moving environment.
- Strong knowledge of observability stacks (Prometheus, Grafana, Loki, ELK) and alerting frameworks.
- Hands‑on scripting in Python and experience with CI/CD pipelines (GitHub Actions, Jenkins, ArgoCD).
- Experience with AWS services (EC2, ECS/EKS, CloudWatch, CloudFormation) and IaC (Terraform, CloudFormation).
- Excellent problem‑solving skills, communication, and a passion for continuous improvement.