remote
Principal Site Reliability Engineer - DigiCert
Site Reliability Engineer
Lead the Platform Ops team to design, build, and operate highly available, scalable cloud infrastructure using Kubernetes, AWS, and Terraform, driving automation, observability, and incident response for a global trust platform.
About the role
Key Responsibilities
- Architect and maintain a resilient, scalable cloud platform on AWS, leveraging Kubernetes and Terraform for infrastructure as code.
- Drive automation of deployment pipelines, configuration management, and monitoring to reduce manual toil and improve reliability.
- Lead incident response, post‑mortem analysis, and continuous improvement initiatives to enhance system uptime and performance.
- Collaborate with cross‑functional teams to define SLOs, SLIs, and SLOs, ensuring alignment with business objectives.
- Mentor and coach junior SREs, fostering a culture of learning, ownership, and proactive problem solving.
Requirements
- 10+ years of experience in site reliability engineering or related roles, with a strong background in cloud-native technologies.
- Proficiency in AWS, Kubernetes, Terraform, and CI/CD tooling (GitHub Actions, Jenkins, ArgoCD).
- Deep understanding of monitoring, logging, and alerting systems (Prometheus, Grafana, ELK, Datadog).
- Excellent troubleshooting skills, with a track record of resolving complex production incidents.
- Strong communication and leadership abilities, capable of influencing technical direction across teams.
Skills
kubernetesawsterraformcicd