remote

Principal Site Reliability Engineer - DigiCert

Site Reliability Engineer

Lead the Platform Ops team to design, build, and operate highly available, scalable cloud infrastructure using Kubernetes, AWS, and Terraform, driving automation, observability, and incident response for a global trust platform.

About the role

Key Responsibilities

Architect and maintain a resilient, scalable cloud platform on AWS, leveraging Kubernetes and Terraform for infrastructure as code.
Drive automation of deployment pipelines, configuration management, and monitoring to reduce manual toil and improve reliability.
Lead incident response, post‑mortem analysis, and continuous improvement initiatives to enhance system uptime and performance.
Collaborate with cross‑functional teams to define SLOs, SLIs, and SLOs, ensuring alignment with business objectives.
Mentor and coach junior SREs, fostering a culture of learning, ownership, and proactive problem solving.

Requirements

10+ years of experience in site reliability engineering or related roles, with a strong background in cloud-native technologies.
Proficiency in AWS, Kubernetes, Terraform, and CI/CD tooling (GitHub Actions, Jenkins, ArgoCD).
Deep understanding of monitoring, logging, and alerting systems (Prometheus, Grafana, ELK, Datadog).
Excellent troubleshooting skills, with a track record of resolving complex production incidents.
Strong communication and leadership abilities, capable of influencing technical direction across teams.

Skills

kubernetesawsterraformcicd

CompanyDigiCert

DepartmentEngineering

LocationLehi, Utah, United States

Experience7+ years

Tenurefull-time

LevelLead

Posted June 24, 2026