remote

Site Reliability Engineer - NICE

Site Reliability Engineer

We're looking for a Site Reliability Engineer focused on designing and building scalable technical solutions. This mid level role requires 3+ years of relevant experience.

About the role

At NiCE, we don’t limit our challenges. We challenge our limits. Always. We’re ambitious. We’re game changers. And we play to win. We set the highest standards and execute beyond them. And if you’re like us, we can offer you the ultimate career opportunity that will light a fire within you.

So, what’s the role all about?

The SRE – NOC role sits at the intersection of traditional Network Operations Center (NOC) responsibilities and engineering‑driven reliability practices . This role focuses on 24/7 service reliability, incident response, operational automation, and observability , while actively reducing operational toil through software and automation.

Unlike a traditional NOC analyst, an SRE‑NOC is expected to engineer problems away , not just respond to alerts.

How will you make an impact?

Incident Response & Operations

Act as a primary or escalation responder in a 24x7 on‑call rotation
Lead or support Major Incident (MI) response , including triage, mitigation, and resolution
Coordinate across Engineering, Infrastructure, Security, and Product teams
Execute and improve runbooks, playbooks, and escalation paths
Drive blameless post‑incident reviews (PIRs) and track corrective actions

Monitoring, Alerting & Observability

Own service health monitoring across infrastructure, applications, and dependencies
Design and maintain alerting strategies that align with SLIs/SLOs
Reduce alert fatigue through signal‑to‑noise improvements
Build dashboards using tools such as:
Grafana
Prometheus
Datadog / Splunk / CloudWatch

Reliability Engineering & Automation

Automate repetitive operational tasks to reduce manual toil
Improve mean time to detect (MTTD) and mean time to resolve (MTTR)
Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows
Implement self‑healing and auto‑remediation where possible
Partner with engineering teams to improve system design for reliability

Platform & Infrastructure Support

Support and troubleshoot:
Linux‑based systems
Cloud platforms (AWS, Azure, GCP)
Kubernetes / containerized environments
Assist with capacity planning and availability reviews
Ensure operational readiness for production releases

Have you got what it takes?

Technical

Strong Linux systems administration
Experience with incident management and production support
Familiarity with:
Cloud infrastructure (AWS preferred)
Containers & orchestration (Docker, Kubernetes)
Monitoring/alerting platforms
Scripting or programming experience in Python, Bash, Go, or similar
Under

About the role

So, what’s the role all about?

Unlike a traditional NOC analyst, an SRE‑NOC is expected to engineer problems away , not just respond to alerts.

How will you make an impact?

Incident Response & Operations

Act as a primary or escalation responder in a 24x7 on‑call rotation
Lead or support Major Incident (MI) response , including triage, mitigation, and resolution
Coordinate across Engineering, Infrastructure, Security, and Product teams
Execute and improve runbooks, playbooks, and escalation paths
Drive blameless post‑incident reviews (PIRs) and track corrective actions

Monitoring, Alerting & Observability

Own service health monitoring across infrastructure, applications, and dependencies
Design and maintain alerting strategies that align with SLIs/SLOs
Reduce alert fatigue through signal‑to‑noise improvements
Build dashboards using tools such as:
Grafana
Prometheus
Datadog / Splunk / CloudWatch

Reliability Engineering & Automation

Automate repetitive operational tasks to reduce manual toil
Improve mean time to detect (MTTD) and mean time to resolve (MTTR)
Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows
Implement self‑healing and auto‑remediation where possible
Partner with engineering teams to improve system design for reliability

Platform & Infrastructure Support

Support and troubleshoot:
Linux‑based systems
Cloud platforms (AWS, Azure, GCP)
Kubernetes / containerized environments
Assist with capacity planning and availability reviews
Ensure operational readiness for production releases

Have you got what it takes?

Technical

Strong Linux systems administration
Experience with incident management and production support
Familiarity with:
Cloud infrastructure (AWS preferred)
Containers & orchestration (Docker, Kubernetes)
Monitoring/alerting platforms
Scripting or programming experience in Python, Bash, Go, or similar
Under

Site Reliability Engineer - NICE

About the role

Site Reliability Engineer - NICE

About the role

Skills