remote

Observability Engineer - Releady

Software Engineer

Junior‑to‑Mid Observability Engineer with 3+ years in SRE/DevOps, driving monitoring, automation, and reliability for large‑scale cloud‑native platforms using Prometheus, Grafana, and Kubernetes.

About the role

Key Responsibilities

Design, implement, and maintain observability solutions across cloud‑native infrastructure, ensuring comprehensive visibility into application and platform performance.
Collaborate with product and infrastructure teams to define monitoring requirements, develop dashboards, and set up alerting rules that drive proactive incident response.
Automate observability workflows using scripting and configuration management tools, reducing manual effort and improving reliability.
Analyze metrics, logs, and traces to root‑cause incidents, provide post‑mortem insights, and recommend capacity and performance improvements.
Stay current with emerging observability tools and best practices, evaluating new technologies for potential adoption.

Requirements

3+ years of experience in Site Reliability Engineering, Platform Operations, or DevOps roles.
Hands‑on expertise with Prometheus, Grafana, and Kubernetes monitoring stacks.
Strong scripting skills (Python, Bash) and familiarity with CI/CD pipelines.
Excellent problem‑solving abilities and a proactive approach to incident management.
Effective communication skills to collaborate across cross‑functional teams.

Skills

prometheusgrafanakubernetes

CompanyReleady

DepartmentEngineering

LocationToronto, CA, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 20, 2026