remote
Staff Engineer, Site Reliability - BABYLIST
Software Engineer
Lead the engineering of highly available, scalable infrastructure for a fast‑growing consumer platform, driving automation, observability, and reliability across AWS, Kubernetes, and cloud-native tooling.
About the role
Key Responsibilities
- Design, build, and maintain production‑grade infrastructure for a high‑traffic consumer platform using AWS, Kubernetes, and Terraform.
- Implement and evolve CI/CD pipelines, ensuring rapid, reliable deployments with zero‑downtime.
- Develop and maintain observability stack (metrics, logs, traces) to detect, diagnose, and remediate incidents proactively.
- Collaborate with cross‑functional teams to define SLOs, SLIs, and incident response processes.
- Mentor and guide junior engineers on best practices in reliability, automation, and cloud architecture.
Requirements
- 5+ years of experience in site reliability or DevOps roles at high‑scale SaaS or e‑commerce companies.
- Proficiency with AWS services (EC2, ECS/EKS, RDS, S3, CloudWatch) and Kubernetes cluster management.
- Strong scripting skills in Python or Go, and infrastructure-as-code experience with Terraform.
- Hands‑on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD) and monitoring/alerting platforms (Prometheus, Grafana, Datadog).
- Excellent problem‑solving, communication, and collaboration skills in a fast‑moving, remote environment.
Skills
kubernetesawspythongoterraformcicd