remote
Site Reliability Specialist - Descartes Systems Group
Software Engineer
Lead the design, deployment, and operation of scalable, secure cloud infrastructure for logistics solutions, leveraging Kubernetes, Docker, AWS, Terraform, and Python to ensure high availability and performance.
About the role
Key Responsibilities
- Architect, deploy, and maintain highly available Kubernetes clusters on AWS, ensuring zero downtime for critical logistics services.
- Implement infrastructure-as-code using Terraform, automating provisioning and configuration across multiple environments.
- Develop and maintain CI/CD pipelines with GitHub Actions and Jenkins, integrating automated testing, security scanning, and blue‑green deployments.
- Design and enforce observability strategies, including Prometheus, Grafana, and ELK stack, to monitor performance, detect anomalies, and drive proactive incident response.
- Collaborate with development teams to optimize application performance, implement best practices for containerization, and enforce security hardening.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles, with a strong focus on cloud-native technologies.
- Proficient in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, CI/CD tooling, and scripting in Python or Bash.
- Deep understanding of monitoring, logging, and alerting best practices.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetesdockerawsterraformpython