remote
Senior Platform/Site Reliability Engineer - Lemon.io
Site Reliability Engineer
Senior Platform/Site Reliability Engineer responsible for designing, deploying, and maintaining highly available, scalable infrastructure using Kubernetes, Docker, AWS, Terraform, and CI/CD pipelines, while ensuring robust monitoring and incident response for a remote, high‑growth startup ecosystem.
About the role
Key Responsibilities
- Design, implement, and manage scalable, highly available Kubernetes clusters across AWS environments.
- Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
- Develop and maintain Docker images, Helm charts, and deployment scripts for microservices.
- Implement comprehensive monitoring, alerting, and logging solutions (Prometheus, Grafana, ELK).
- Lead incident response, root‑cause analysis, and post‑mortem documentation to improve reliability.
- Collaborate with development teams to enforce best practices for performance, security, and cost optimization.
Requirements
- 5+ years of experience in platform or site reliability engineering.
- Proficient with Kubernetes, Docker, and cloud-native tooling.
- Hands‑on experience with AWS services (EKS, EC2, S3, RDS).
- Strong scripting skills in Bash, Python, or Go.
- Experience with Terraform, Helm, and CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
Skills
kubernetesdockerawsterraformcicd