remote

Senior Site Reliability Engineer - Cribl

Site Reliability Engineer

Senior Site Reliability Engineer responsible for designing, automating, and scaling highly available telemetry platforms using Kubernetes, AWS, and modern IaC tools while driving observability and performance improvements.

About the role

Key Responsibilities

Design, implement, and operate scalable, fault‑tolerant services on Kubernetes and AWS to support high‑volume telemetry ingestion.
Develop automation and infrastructure‑as‑code solutions using Terraform, Python, and Go to streamline provisioning and deployment pipelines.
Build and maintain robust monitoring, alerting, and observability stacks with Prometheus, Grafana, and custom metrics.
Collaborate with development, security, and product teams to define SLOs/SLA targets and ensure reliability across the stack.
Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system resilience.

Requirements

5+ years of production SRE or DevOps experience in cloud environments, preferably AWS.
Strong programming/scripting skills in Python and Go, with solid Linux system administration background.
Hands‑on experience managing Kubernetes clusters at scale and implementing IaC with Terraform.
Proficiency in building observability solutions using Prometheus, Grafana, and related tooling.
Demonstrated ability to troubleshoot complex distributed systems and drive automation for reliability.

Skills

pythongokubernetesawsterraformprometheuslinux

CompanyCribl

DepartmentEngineering

LocationUnited States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 19, 2026