remote
Senior Site Reliability Engineer - DigiCert
Site Reliability Engineer
Senior SRE who drives reliability, scalability, and performance for cloud‑native services, leveraging Kubernetes, AWS, Terraform, and automation with Python and CI/CD pipelines.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS using Kubernetes, Docker, and Terraform.
- Develop and own monitoring, alerting, and observability solutions with Prometheus, Grafana, and custom Python scripts.
- Collaborate with development teams to embed reliability best practices into CI/CD pipelines and application code.
- Automate routine operational tasks, incident response, and post‑mortem analysis to continuously improve system resilience.
- Lead capacity planning, performance tuning, and disaster‑recovery testing for mission‑critical services.
Requirements
- 5+ years of experience in site reliability or DevOps roles, with deep expertise in Kubernetes, Docker, and AWS services.
- Proficiency in infrastructure‑as‑code tools such as Terraform and strong scripting skills in Python or Bash.
- Hands‑on experience with monitoring stacks (Prometheus, Grafana) and building robust CI/CD pipelines (Jenkins, GitLab CI, or similar).
- Solid understanding of Linux systems, networking, and security best practices.
- Track record of driving reliability improvements, incident management, and automation at scale.
Skills
kubernetesdockerawsterraformpythonprometheuscicdlinux