remote
Principal Site Reliability Engineer - Saviynt
Site Reliability Engineer
Lead the design, build, and operation of scalable, reliable infrastructure services for a global SaaS platform, driving automation, observability, and incident response excellence using Kubernetes, AWS, Terraform, and advanced CI/CD practices.
About the role
Key Responsibilities
- Architect and maintain shared infrastructure services that support product and application teams across a mission‑critical SaaS platform.
- Design and implement scalable, highly available Kubernetes clusters and associated tooling on AWS.
- Develop and enforce IaC standards using Terraform, ensuring repeatable, auditable deployments.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve reliability.
- Collaborate with security and compliance teams to embed best practices into all infrastructure components.
- Mentor and guide SRE and DevOps teams, fostering a culture of automation, observability, and proactive problem‑solving.
Requirements
- 10+ years of experience in site reliability engineering or related roles, with a strong focus on cloud-native environments.
- Deep expertise in Kubernetes, AWS services (EKS, EC2, S3, CloudWatch), and Terraform.
- Proven track record of building robust CI/CD pipelines and implementing comprehensive monitoring/alerting solutions.
- Strong incident‑management skills and experience with post‑mortem documentation.
- Excellent communication and leadership abilities, with a passion for mentoring junior engineers.
Skills
kubernetesawsterraformcicd