remote
Site Reliability Engineer - STN Inc
Site Reliability Engineer
Senior Site Reliability Engineer responsible for defining SLOs, building observability stacks, and leading incident response for a GPU‑as‑a‑Service platform using Kubernetes, Prometheus, Grafana, Docker, and AWS.
About the role
Key Responsibilities
- Define and enforce Service Level Objectives (SLOs) aligned with customer SLAs and contractual commitments.
- Build, maintain, and evolve the observability stack, including metrics collection, alerting, and dashboards with Prometheus, Grafana, and related tooling.
- Lead major incident investigations, root‑cause analysis, and post‑mortem documentation to drive continuous reliability improvements.
- Collaborate with platform, security, and development teams to implement automation, capacity planning, and performance tuning.
- Participate in on‑call rotations, ensuring rapid response and resolution of production incidents.
Requirements
- 5+ years of experience in site reliability or DevOps roles, preferably in cloud‑native environments.
- Proficiency with Kubernetes, Docker, and container orchestration best practices.
- Hands‑on experience with Prometheus, Grafana, and alerting systems.
- Strong scripting skills in Python or Bash for automation and tooling.
- Experience with AWS services (EKS, CloudWatch, S3, etc.) and CI/CD pipelines.
Skills
pythonkubernetesprometheusgrafanaawsdocker