remote
Principal Site Reliability Engineer - Uniphore
Site Reliability Engineer
Lead the design and operation of highly available, scalable AI‑native services on AWS, driving automation, observability, and resilience for enterprise‑grade deployments.
About the role
Key Responsibilities
- Architect, build, and maintain production‑grade infrastructure for AI‑native services using Kubernetes, AWS, and Terraform.
- Design and implement CI/CD pipelines that enable rapid, reliable releases of microservices and AI models.
- Develop and enforce observability standards—metrics, logs, traces—to detect, diagnose, and remediate incidents at scale.
- Collaborate with cross‑functional teams to define SLOs, SLAs, and runbooks, ensuring continuous improvement of reliability.
- Lead incident response, post‑mortem analysis, and root‑cause investigations, driving systemic changes.
Requirements
- 10+ years of experience in site reliability or DevOps roles, with a proven track record in large‑scale, cloud‑native environments.
- Deep expertise in Kubernetes, AWS services (EKS, EC2, S3, CloudWatch), and infrastructure as code (Terraform).
- Strong background in CI/CD tooling (GitHub Actions, Jenkins, ArgoCD) and scripting (Python, Bash).
- Hands‑on experience with observability stacks (Prometheus, Grafana, Loki, OpenTelemetry).
- Excellent communication skills and a collaborative mindset for working with engineering, product, and operations teams.
Skills
kubernetesawscicdterraform