onsite

Site Reliability Engineer - NOV

Site Reliability Engineer

Lead production reliability and incident response, driving performance tuning, automation, and operational excellence across a scalable platform to maximize uptime and user satisfaction.

About the role

Key Responsibilities

Own operational excellence and incident management for production systems, ensuring high availability and low latency.
Design, implement, and maintain monitoring, alerting, and observability pipelines to detect and resolve performance issues proactively.
Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve reliability.
Collaborate with software engineering teams to embed reliability best practices into the development lifecycle.
Drive automation of deployment, scaling, and recovery processes to reduce manual toil and accelerate feature delivery.

Requirements

Proven experience as a Site Reliability Engineer or similar role in a high‑scale production environment.
Strong knowledge of monitoring tools (Prometheus, Grafana, Datadog, etc.) and incident response frameworks.
Hands‑on expertise in performance tuning, capacity planning, and scalability engineering.
Solid scripting skills (Python, Bash, or similar) for automation and tooling.
Excellent communication and collaboration skills to work across engineering, operations, and product teams.

Skills

pythonbashawsgcpazurekubernetesdockerterraform

CompanyNOV

DepartmentEngineering

LocationKerala, India

Experience9+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026