onsite
Site Reliability Engineer - adaptyv
Site Reliability Engineer
Join a fast‑growing biotech startup as a Site Reliability Engineer, building scalable, automated infrastructure that lets AI agents run biology experiments. Leverage Kubernetes, Docker, AWS, Terraform, and observability tools to ensure reliability and performance of the lab automation platform.
About the role
Key Responsibilities
- Design, deploy, and maintain highly available Kubernetes clusters that support AI‑driven biology experiment pipelines.
- Implement CI/CD pipelines using GitHub Actions, Terraform, and Helm to automate application and infrastructure updates.
- Monitor system health with Prometheus, Grafana, and custom alerts, ensuring 99.9% uptime for critical lab workflows.
- Collaborate with software, data science, and hardware teams to troubleshoot performance bottlenecks and optimize resource utilization.
- Drive incident response, post‑mortem analysis, and continuous improvement of reliability practices.
Requirements
- 3+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Hands‑on experience with Terraform, Helm, and CI/CD tooling.
- Strong scripting skills in Bash or Python for automation.
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetesdockerawsterraformprometheus