remote

Cloud Infrastructure Engineer Open LMS Colombia, Remote

Cloud Infrastructure Engineer Open LMS Colombia, Remote position — see original posting for full details.

About the role

Role Description

We are looking for a Senior Cloud Infrastructure Engineer to join our team and help build, scale, and evolve our multi-tenant SaaS hosting platform on AWS. Our platform dynamically provisions, manages, and scales hundreds of Moodle LMS instances for education clients — powered by custom orchestration tooling, distributed service discovery, and infrastructure as code.

This is a hands-on infrastructure role. You'll work across the full stack — from Terraform modules and Puppet manifests to Python automation and observability pipelines. The platform is not containerised — there is no Kubernetes here — so we're looking for someone who understands Linux systems deeply and can reason about distributed systems problems from first principles.

You'll have real ownership and influence over the platform's architecture and direction as we continue to grow and evolve the infrastructure.

What You'll Be Doing

Designing, building, and maintaining AWS infrastructure using Terraform (EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, VPC networking)
Writing and maintaining Puppet modules to configure and manage fleets of EC2 instances across multiple auto-scaling groups
Maintaining and extending Python-based automation and tooling that supports platform operations
Operating and improving distributed service discovery and configuration management (etcd)
Managing and tuning a multi-tier caching strategy (Varnish, Redis/Valkey, PHP OPcache)
Running and scaling our observability stack (Prometheus, Grafana, Loki, Fluentd, PagerDuty) and participating in on-call rotations
Evaluating and implementing distributed storage solutions as the platform evolves
Improving deployment workflows and release processes
Collaborating with internal teams on API contracts, integration patterns, and operational tooling
Participating in incident response, root cause analysis, and platform reliability improvements

Skills and Aptitudes

Strong experience with AWS services in production — particularly EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, IAM, and VPC networking
Proficiency in authoring and maintaining Terraform modules for production infrastructure
Proficiency in authoring and maintaining Puppet modules (or equivalent agent-based configuration management) for fleet management
Solid Python skills — you'll be writing and maintaining production daemons, not just scripts
Deep Linux systems knowledge (Ubuntu) — comfortable with Apache/Nginx, PHP-FPM, Varnish, systemd, filesystem mounts, and networking fundamentals
Understanding of distributed systems concepts: consensus, leader election, distributed locking, eventual consistency, and the tradeoffs involved
Proficiency in building and maintaining observability pipelines (Prometheus,

About the role

Role Description

You'll have real ownership and influence over the platform's architecture and direction as we continue to grow and evolve the infrastructure.

What You'll Be Doing

Designing, building, and maintaining AWS infrastructure using Terraform (EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, VPC networking)
Writing and maintaining Puppet modules to configure and manage fleets of EC2 instances across multiple auto-scaling groups
Maintaining and extending Python-based automation and tooling that supports platform operations
Operating and improving distributed service discovery and configuration management (etcd)
Managing and tuning a multi-tier caching strategy (Varnish, Redis/Valkey, PHP OPcache)
Running and scaling our observability stack (Prometheus, Grafana, Loki, Fluentd, PagerDuty) and participating in on-call rotations
Evaluating and implementing distributed storage solutions as the platform evolves
Improving deployment workflows and release processes
Collaborating with internal teams on API contracts, integration patterns, and operational tooling
Participating in incident response, root cause analysis, and platform reliability improvements

Skills and Aptitudes

Strong experience with AWS services in production — particularly EC2, RDS, S3, SQS, Lambda, ALB, ElastiCache, Route 53, IAM, and VPC networking
Proficiency in authoring and maintaining Terraform modules for production infrastructure
Proficiency in authoring and maintaining Puppet modules (or equivalent agent-based configuration management) for fleet management
Solid Python skills — you'll be writing and maintaining production daemons, not just scripts
Deep Linux systems knowledge (Ubuntu) — comfortable with Apache/Nginx, PHP-FPM, Varnish, systemd, filesystem mounts, and networking fundamentals
Understanding of distributed systems concepts: consensus, leader election, distributed locking, eventual consistency, and the tradeoffs involved
Proficiency in building and maintaining observability pipelines (Prometheus,