onsite
Site Reliability Engineer - NOV
Site Reliability Engineer
Lead production reliability and incident response, driving performance tuning, automation, and operational excellence across a scalable platform to maximize uptime and user satisfaction.
About the role
Key Responsibilities
- Own operational excellence and incident management for production systems, ensuring high availability and low latency.
- Design, implement, and maintain monitoring, alerting, and observability pipelines to detect and resolve performance issues proactively.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
- Collaborate with software engineering teams to embed reliability best practices into the development lifecycle.
- Drive automation of deployment, scaling, and recovery processes to reduce manual toil and accelerate feature delivery.
Requirements
- Proven experience as a Site Reliability Engineer or similar role in a high‑scale production environment.
- Strong knowledge of monitoring tools (Prometheus, Grafana, Datadog, etc.) and incident response frameworks.
- Hands‑on expertise in performance tuning, capacity planning, and scalability engineering.
- Solid scripting skills (Python, Bash, or similar) for automation and tooling.
- Excellent communication and collaboration skills to work across engineering, operations, and product teams.
Skills
pythonbashawsgcpazurekubernetesdockerterraform