remote
Lead Site Reliability Engineer Edge & Cloud - Advertimavisionag
Site Reliability Engineer
Lead the design and operation of resilient edge and cloud infrastructures, driving automation, performance, and reliability across distributed services using Kubernetes, Docker, and modern observability tools.
About the role
Key Responsibilities
- Architect, deploy, and maintain highly available edge and cloud services, ensuring 99.99% uptime and rapid incident response.
- Lead automation initiatives with CI/CD pipelines, infrastructure as code, and container orchestration (Kubernetes, Docker).
- Implement robust monitoring, logging, and alerting solutions to detect and remediate performance bottlenecks and outages.
- Collaborate with development, security, and product teams to embed reliability best practices into the software delivery lifecycle.
- Mentor and grow a high‑performing SRE team, fostering a culture of continuous improvement and knowledge sharing.
Requirements
- 5+ years of SRE or DevOps experience in large‑scale distributed systems.
- Deep expertise in Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
- Proficiency with CI/CD tools (GitHub Actions, GitLab CI, Jenkins) and IaC (Terraform, CloudFormation).
- Strong scripting skills (Python, Bash) and experience with monitoring stacks (Prometheus, Grafana, ELK).
- Excellent problem‑solving, communication, and leadership abilities.
Skills
kubernetesdockercicd