onsite

Senior Site Reliability Engineer - Arkestro

Site Reliability Engineer

Lead the reliability and scalability of a data‑driven procurement platform, designing and maintaining cloud infrastructure, monitoring, and automation to ensure high availability and performance.

About the role

Key Responsibilities

Design, implement, and maintain scalable, highly available infrastructure on AWS using Terraform and CloudFormation.
Manage containerized workloads with Kubernetes, ensuring efficient deployment, scaling, and rolling updates.
Build and maintain observability stack (Prometheus, Grafana, Loki) for real‑time monitoring, alerting, and incident response.
Automate CI/CD pipelines with GitHub Actions and ArgoCD, enforcing best practices for code quality and deployment speed.
Collaborate with development teams to embed reliability principles into application design and code reviews.
Lead post‑mortem analyses, root cause investigations, and implement preventive measures to reduce MTTR.

Requirements

5+ years of SRE or DevOps experience in a production environment.
Proficient with Kubernetes, Docker, and cloud-native tooling.
Strong scripting skills in Python or Bash for automation.
Hands‑on experience with AWS services (EC2, RDS, S3, CloudWatch, IAM).
Solid understanding of monitoring, alerting, and incident management practices.

Skills

kubernetesdockerawsterraformprometheusgrafanapython

CompanyArkestro

DepartmentEngineering

LocationUnited States

Experience5+ years

Tenurefull-time

LevelSenior

Posted June 27, 2026