remote
Observability Engineer - Releady
Software Engineer
Junior‑to‑Mid Observability Engineer with 3+ years in SRE/DevOps, driving monitoring, automation, and reliability for large‑scale cloud‑native platforms using Prometheus, Grafana, and Kubernetes.
About the role
Key Responsibilities
- Design, implement, and maintain observability solutions across cloud‑native infrastructure, ensuring comprehensive visibility into application and platform performance.
- Collaborate with product and infrastructure teams to define monitoring requirements, develop dashboards, and set up alerting rules that drive proactive incident response.
- Automate observability workflows using scripting and configuration management tools, reducing manual effort and improving reliability.
- Analyze metrics, logs, and traces to root‑cause incidents, provide post‑mortem insights, and recommend capacity and performance improvements.
- Stay current with emerging observability tools and best practices, evaluating new technologies for potential adoption.
Requirements
- 3+ years of experience in Site Reliability Engineering, Platform Operations, or DevOps roles.
- Hands‑on expertise with Prometheus, Grafana, and Kubernetes monitoring stacks.
- Strong scripting skills (Python, Bash) and familiarity with CI/CD pipelines.
- Excellent problem‑solving abilities and a proactive approach to incident management.
- Effective communication skills to collaborate across cross‑functional teams.
Skills
prometheusgrafanakubernetes