remoteonsite
Principal Site Reliability Engineer - Persistent Systems
Site Reliability Engineer
Lead the design and operation of highly available, scalable cloud services using Kubernetes, Docker, and AWS, driving automation, observability, and incident response excellence.
About the role
Key Responsibilities
- Architect, deploy, and maintain production‑grade Kubernetes clusters and containerized workloads across AWS environments.
- Design and implement CI/CD pipelines, infrastructure as code (Terraform), and automated testing to accelerate release cycles.
- Establish and enforce SLOs, SLIs, and incident management processes, ensuring rapid detection, triage, and resolution of outages.
- Collaborate with development, security, and product teams to embed reliability best practices into the software development lifecycle.
- Lead post‑mortem analyses, root‑cause investigations, and continuous improvement initiatives to reduce MTTR and prevent recurrence.
Requirements
- 10+ years of experience in large‑scale distributed systems and cloud operations.
- Deep expertise in Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Proficiency with Terraform, Git, Jenkins/ArgoCD, and monitoring tools (Prometheus, Grafana, ELK).
- Strong scripting skills in Python or Go and a solid understanding of networking, security, and compliance.
- Excellent communication, mentorship, and problem‑solving abilities in a fast‑paced environment.
Skills
kubernetesdockerawsterraformcicdpython