remote
Senior Site Reliability Engineer Platform - Bitdeer Technologies Group
Site Reliability Engineer
Senior SRE focused on building and operating highly available, automated platforms for AI and Bitcoin mining workloads using Kubernetes, Terraform, Go, Python and AWS.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, fault‑tolerant platform services on Kubernetes for AI and mining workloads.
- Automate infrastructure provisioning and configuration management using Terraform and IaC best practices.
- Develop and support internal tooling and services in Go and Python to improve reliability and operational efficiency.
- Implement robust CI/CD pipelines, monitoring, and alerting (Prometheus, Grafana) to ensure high availability and rapid incident response.
- Collaborate with cross‑functional teams to define SLOs/SLA targets, conduct capacity planning, and drive performance optimizations.
Requirements
- 5+ years of experience in site reliability or platform engineering, preferably in high‑performance compute environments.
- Strong expertise with Kubernetes orchestration, Terraform, and cloud platforms (AWS).
- Proficient programming skills in Go and Python for automation and service development.
- Hands‑on experience building CI/CD pipelines and implementing observability stacks (Prometheus, Grafana, logging).
- Solid understanding of networking, security, and distributed systems, with a track record of incident management and root‑cause analysis.
Skills
kubernetesterraformgopythonawscicdprometheus