remote
Senior Site Reliability Engineer - Hard Rock Digital
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available, scalable gaming platforms on AWS, leveraging Kubernetes, Docker, and CI/CD pipelines while monitoring with Prometheus and Grafana and managing infrastructure as code with Terraform.
About the role
Senior Site Reliability Engineer at Hard Rock Digital.
Key technologies: LLM, LangChain, Java, Kubernetes, Prometheus, Grafana.
Key Responsibilities
- Define and track SLOs, SLIs and error budgets
- Design and implement observability stacks (metrics, logging, tracing)
- Automate toil and improve system reliability through engineering
- Conduct post-mortems and drive blameless incident retrospectives
Requirements
- 5+ years of relevant experience in site reliability engineer
- Proficiency with monitoring tools (Prometheus, Grafana, Datadog)
- Strong programming skills for automation and tooling
Skills
kubernetesdockercicdawsprometheusgrafanaterraform