remote
Junior Site Reliability Engineer - Fable
Site Reliability Engineer
Junior Site Reliability Engineer responsible for ensuring the reliability, performance, and scalability of Fable’s accessibility platform using Kubernetes, Docker, AWS, and modern monitoring tools, while collaborating with development teams to implement CI/CD pipelines and automation scripts.
About the role
Key Responsibilities
- Maintain and improve the reliability and uptime of production services running on Kubernetes clusters in AWS.
- Implement and manage CI/CD pipelines, ensuring automated testing, deployment, and rollbacks.
- Configure and monitor observability stack (Prometheus, Grafana, Loki) to detect and resolve performance bottlenecks.
- Automate infrastructure provisioning and configuration using IaC tools and Python scripts.
- Collaborate with developers to troubleshoot incidents, conduct post‑mortems, and implement preventive measures.
Requirements
- 1–2 years of experience in site reliability or DevOps roles.
- Proficiency in scripting (Python or Bash) for automation tasks.
- Familiarity with monitoring and alerting tools such as Prometheus, Grafana, and Loki.
- Strong problem‑solving skills and a proactive approach to incident management.
Skills
kubernetesdockerawscicdpython