remote

Site Reliability Engineer - STN Inc

Site Reliability Engineer

Senior Site Reliability Engineer responsible for defining SLOs, building observability stacks, and leading incident response for a GPU‑as‑a‑Service platform using Kubernetes, Prometheus, Grafana, Docker, and AWS.

About the role

Key Responsibilities

Define and enforce Service Level Objectives (SLOs) aligned with customer SLAs and contractual commitments.
Build, maintain, and evolve the observability stack, including metrics collection, alerting, and dashboards with Prometheus, Grafana, and related tooling.
Lead major incident investigations, root‑cause analysis, and post‑mortem documentation to drive continuous reliability improvements.
Collaborate with platform, security, and development teams to implement automation, capacity planning, and performance tuning.
Participate in on‑call rotations, ensuring rapid response and resolution of production incidents.

Requirements

5+ years of experience in site reliability or DevOps roles, preferably in cloud‑native environments.
Proficiency with Kubernetes, Docker, and container orchestration best practices.
Hands‑on experience with Prometheus, Grafana, and alerting systems.
Strong scripting skills in Python or Bash for automation and tooling.
Experience with AWS services (EKS, CloudWatch, S3, etc.) and CI/CD pipelines.

Skills

pythonkubernetesprometheusgrafanaawsdocker

CompanySTN Inc

DepartmentEngineering

LocationSan Francisco Bay Area, California, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026