remote
Site Reliability Engineer - By Light Professional IT Services
Site Reliability Engineer
Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud infrastructure using Kubernetes, Docker, AWS, and Terraform, ensuring performance, reliability, and security for mission‑critical applications.
About the role
Key Responsibilities
- Design, implement, and manage scalable Kubernetes clusters on AWS, ensuring high availability and fault tolerance.
- Automate infrastructure provisioning and configuration using Terraform, maintaining version control and reproducibility.
- Develop and maintain CI/CD pipelines for application deployment, monitoring, and rollback strategies.
- Implement robust monitoring, logging, and alerting solutions (Prometheus, Grafana, CloudWatch) to detect and resolve incidents proactively.
- Collaborate with development teams to optimize application performance, security, and cost efficiency.
Requirements
- 3+ years of experience in site reliability or DevOps roles with a focus on cloud-native technologies.
- Hands‑on expertise with Kubernetes, Docker, and AWS services (EKS, EC2, S3, RDS).
- Proficiency in infrastructure-as-code using Terraform or similar tools.
- Strong scripting skills (Python, Bash) and experience with CI/CD tools (Jenkins, GitHub Actions).
- Excellent problem‑solving abilities and a proactive approach to incident management.
Skills
kubernetesdockerawsterraform