remote
SRE - AWS/Azure - Tech Next
Site Reliability Engineer
Senior Site Reliability Engineer with 8+ years of experience driving reliability, scalability, and automation across AWS and Azure environments, focusing on observability, incident response, and platform engineering for mission‑critical production systems.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS and Azure for mission‑critical services.
- Develop and automate deployment pipelines, configuration management, and monitoring solutions using IaC tools.
- Lead incident response, root cause analysis, and post‑mortem processes to continuously improve reliability.
- Collaborate with engineering teams to embed observability, performance tuning, and capacity planning into the development lifecycle.
- Drive platform engineering initiatives, including service mesh, container orchestration, and security hardening.
Requirements
- 8+ years of SRE or DevOps experience with deep expertise in AWS and Azure.
- Proficient in automation tools (Terraform, Ansible, Pulumi) and CI/CD pipelines.
- Strong background in observability (Prometheus, Grafana, ELK, CloudWatch) and incident management.
- Experience with container orchestration (Kubernetes, ECS, AKS) and service mesh technologies.
- Excellent problem‑solving skills, strong communication, and a proactive, collaborative mindset.