remote
Site Reliability Engineer - Aalyria Careers
Site Reliability Engineer
Drive reliability and scalability of cloud-native services using AWS, Kubernetes, and automation tools. Collaborate with development teams to design, deploy, and maintain highly available infrastructure, ensuring performance, security, and continuous improvement.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, highly available infrastructure on AWS using Terraform and CloudFormation.
- Manage Kubernetes clusters, ensuring optimal resource utilization, rolling updates, and zero-downtime deployments.
- Develop and maintain CI/CD pipelines with GitHub Actions, Jenkins, or ArgoCD to automate build, test, and release processes.
- Implement robust monitoring, logging, and alerting using Prometheus, Grafana, ELK stack, and CloudWatch.
- Collaborate with development teams to troubleshoot production incidents, perform root cause analysis, and drive post‑mortem improvements.
- Enforce security best practices, including IAM policies, network segmentation, and vulnerability scanning.
Requirements
- 3+ years of experience in site reliability or DevOps roles.
- Proficient with AWS services (EC2, RDS, S3, VPC, IAM) and infrastructure-as-code tools.
- Hands‑on experience with Kubernetes, Docker, and container orchestration.
- Strong scripting skills in Python or Bash for automation.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
awskubernetesdockerterraformpythoncicd