remote
Senior Site Reliability Engineer - Cribl
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, automating, and scaling highly available telemetry platforms using Kubernetes, AWS, and modern IaC tools while driving observability and performance improvements.
About the role
Key Responsibilities
- Design, implement, and operate scalable, fault‑tolerant services on Kubernetes and AWS to support high‑volume telemetry ingestion.
- Develop automation and infrastructure‑as‑code solutions using Terraform, Python, and Go to streamline provisioning and deployment pipelines.
- Build and maintain robust monitoring, alerting, and observability stacks with Prometheus, Grafana, and custom metrics.
- Collaborate with development, security, and product teams to define SLOs/SLA targets and ensure reliability across the stack.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system resilience.
Requirements
- 5+ years of production SRE or DevOps experience in cloud environments, preferably AWS.
- Strong programming/scripting skills in Python and Go, with solid Linux system administration background.
- Hands‑on experience managing Kubernetes clusters at scale and implementing IaC with Terraform.
- Proficiency in building observability solutions using Prometheus, Grafana, and related tooling.
- Demonstrated ability to troubleshoot complex distributed systems and drive automation for reliability.
Skills
pythongokubernetesawsterraformprometheuslinux