remote
Site Reliability Engineer SRE - Tech Next
Site Reliability Engineer
Senior Site Reliability Engineer with 8+ years of experience driving reliability, scalability, and automation across mission‑critical cloud platforms, leveraging Kubernetes, AWS, and advanced observability tools to ensure high performance and operational excellence.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS, ensuring 99.99% uptime for mission‑critical services.
- Develop and automate deployment pipelines, configuration management, and monitoring solutions using Kubernetes, Terraform, and CI/CD tools.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system reliability.
- Collaborate with engineering teams to embed observability, performance testing, and capacity planning into the development lifecycle.
- Drive platform engineering initiatives, standardizing best practices, tooling, and documentation across the organization.
Requirements
- 8+ years of SRE or DevOps experience in large‑scale production environments.
- Proficiency with AWS services, Kubernetes, Terraform, and CI/CD pipelines.
- Strong scripting skills in Python or Bash for automation and tooling.
- Deep understanding of monitoring, logging, and alerting platforms (Prometheus, Grafana, ELK).
- Excellent problem‑solving, communication, and collaboration skills.