remote
Staff/Senior DevOps Consultant - Observability - 10pearls
Software Engineer
Lead the design and implementation of end‑to‑end observability solutions for large‑scale cloud platforms, driving performance, reliability, and automation across Kubernetes, CI/CD pipelines, and AWS environments.
About the role
Key Responsibilities
- Architect and deploy comprehensive observability stacks (Prometheus, Grafana, Loki, Tempo) across multi‑cluster Kubernetes environments.
- Design and automate monitoring, alerting, and incident response workflows using IaC (Terraform, CloudFormation) and CI/CD pipelines.
- Collaborate with engineering, security, and product teams to define SLOs, SLIs, and dashboards that drive data‑driven decision making.
- Lead troubleshooting of production incidents, root cause analysis, and post‑mortem documentation.
- Mentor and coach junior DevOps engineers on best practices in observability, automation, and cloud operations.
Requirements
- 10+ years of experience in DevOps or Site Reliability Engineering roles.
- Deep expertise in Kubernetes, Prometheus, Grafana, and related observability tools.
- Proficient with AWS services (EKS, CloudWatch, Lambda) and IaC tools (Terraform, CloudFormation).
- Strong scripting skills in Python or Go and experience with CI/CD tools (GitHub Actions, ArgoCD, Jenkins).
- Excellent communication skills and a proven track record of leading technical initiatives.
Skills
kubernetesprometheusgrafanacicdawsterraform