remote
Database SRE Manager Remote, AUS - CrowdStrike
Site Reliability Engineer
Lead a remote Database SRE team, driving reliability, automation, and performance for mission‑critical data services using Kubernetes, Prometheus, Grafana, AWS, Terraform, and Python. Own incident response, capacity planning, and continuous improvement of database infrastructure.
About the role
Key Responsibilities
- Lead and mentor a distributed team of Database SREs, ensuring high availability and performance of critical data services.
- Design, implement, and maintain Kubernetes‑based database clusters, leveraging Prometheus and Grafana for observability.
- Automate infrastructure provisioning and configuration with Terraform, AWS services, and Python scripts.
- Own incident response, root‑cause analysis, and post‑mortem processes to continuously improve reliability.
- Collaborate with DevOps, security, and product teams to define SLAs, capacity plans, and disaster‑recovery strategies.
Requirements
- 5+ years of experience in database operations, SRE, or site reliability engineering.
- Proficiency with Kubernetes, Prometheus, Grafana, AWS, Terraform, and Python.
- Strong incident management and root‑cause analysis skills.
- Excellent communication and leadership abilities in a remote, distributed environment.
Skills
kubernetesprometheusgrafanaawsterraformpython