remote
Platform Operations Engineer Site Reliability Engineer - Vertiv
Site Reliability Engineer
Platform Operations Engineer (SRE) driving cross‑platform observability, monitoring, and incident response across a diverse digital ecosystem using Python, Node.js, Kubernetes, Prometheus, and Grafana.
About the role
Key Responsibilities
- Design, implement, and maintain end‑to‑end monitoring and alerting pipelines for a multi‑tool digital platform stack.
- Own incident response workflows, root‑cause analysis, and post‑mortem documentation to improve reliability.
- Collaborate with development and product teams to define SLAs, SLOs, and reliability metrics.
- Automate operational tasks and configuration management using Python and Node.js scripts.
- Integrate observability solutions with enterprise tools such as Compass AI, Writer AI, Site Scope, UiPath, Workato, and Cursor.
Requirements
- 3+ years of SRE or platform operations experience in a cloud‑native environment.
- Hands‑on experience with incident management platforms and post‑mortem processes.
- Excellent communication skills and a collaborative mindset.
Skills
pythonnodejskubernetesprometheusgrafana