onsite
Site Reliability Engineer - Cloud Solutions International Pvt Ltd
Site Reliability Engineer
Site Reliability Engineer focused on ensuring high availability and performance of cloud services, leveraging Kubernetes, Docker, and monitoring tools like Prometheus and Grafana, while automating infrastructure with Terraform and scripting in Python to resolve incidents and improve operational efficiency.
About the role
Site Reliability Engineer at Cloud Solutions International Pvt Ltd.
Key technologies: Kubernetes, Prometheus, Grafana.
Key Responsibilities
- Define and track SLOs, SLIs and error budgets
- Design and implement observability stacks (metrics, logging, tracing)
- Automate toil and improve system reliability through engineering
- Conduct post-mortems and drive blameless incident retrospectives
Requirements
- 3+ years of relevant experience in site reliability engineer
- Proficiency with monitoring tools (Prometheus, Grafana, Datadog)
- Strong programming skills for automation and tooling
Skills
kubernetesdockerprometheusgrafanaawsterraformpython