remote

Principal Site Reliability Engineer - Uniphore

Site Reliability Engineer

Lead the design and operation of highly available, scalable AI‑native services on AWS, driving automation, observability, and resilience for enterprise‑grade deployments.

About the role

Key Responsibilities

Architect, build, and maintain production‑grade infrastructure for AI‑native services using Kubernetes, AWS, and Terraform.
Design and implement CI/CD pipelines that enable rapid, reliable releases of microservices and AI models.
Develop and enforce observability standards—metrics, logs, traces—to detect, diagnose, and remediate incidents at scale.
Collaborate with cross‑functional teams to define SLOs, SLAs, and runbooks, ensuring continuous improvement of reliability.
Lead incident response, post‑mortem analysis, and root‑cause investigations, driving systemic changes.

Requirements

10+ years of experience in site reliability or DevOps roles, with a proven track record in large‑scale, cloud‑native environments.
Deep expertise in Kubernetes, AWS services (EKS, EC2, S3, CloudWatch), and infrastructure as code (Terraform).
Strong background in CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) and scripting (Python, Bash).
Hands‑on experience with observability stacks (Prometheus, Grafana, Loki, OpenTelemetry).
Excellent communication skills and a collaborative mindset for working with engineering, product, and operations teams.

Skills

kubernetesawscicdterraform

CompanyUniphore

DepartmentEngineering

LocationPalo Alto, CA, United States

Experience7+ years

Tenurefull-time

LevelLead

Salary335,811

Posted June 19, 2026