remoteonsite

Principal Site Reliability Engineer - Persistent Systems

Site Reliability Engineer

Lead the design and operation of highly available, scalable cloud services using Kubernetes, Docker, and AWS, driving automation, observability, and incident response excellence.

About the role

Key Responsibilities

Architect, deploy, and maintain production‑grade Kubernetes clusters and containerized workloads across AWS environments.
Design and implement CI/CD pipelines, infrastructure as code (Terraform), and automated testing to accelerate release cycles.
Establish and enforce SLOs, SLIs, and incident management processes, ensuring rapid detection, triage, and resolution of outages.
Collaborate with development, security, and product teams to embed reliability best practices into the software development lifecycle.
Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce MTTR and prevent recurrence.

Requirements

10+ years of experience in large‑scale distributed systems and cloud operations.
Deep expertise in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
Proficiency with Terraform, Git, Jenkins/ArgoCD, and monitoring tools (Prometheus, Grafana, ELK).
Strong scripting skills in Python or Go and a solid understanding of networking, security, and compliance.
Excellent communication, mentorship, and problem‑solving abilities in a fast‑paced environment.

Skills

kubernetesdockerawsterraformcicdpython

CompanyPersistent Systems

DepartmentEngineering

LocationKarnataka, India

Experience9+ years

Tenurefull-time

LevelLead

Posted June 27, 2026