remote
Senior Site Reliability Engineer - GoGuardian
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, scaling, and automating cloud infrastructure, ensuring high availability and performance of education technology services using Kubernetes, AWS, Terraform, and modern monitoring tools.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on AWS supporting critical K‑12 learning platforms.
- Develop and manage container orchestration using Kubernetes, including deployment pipelines, service mesh, and autoscaling strategies.
- Automate infrastructure provisioning and configuration management with Terraform and Python scripts.
- Implement robust monitoring, alerting, and observability solutions using Prometheus, Grafana, and related tooling.
- Collaborate with development and product teams to improve CI/CD processes, reduce incident response times, and drive reliability best practices.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, preferably in SaaS or education technology environments.
- Deep expertise with AWS services (EC2, RDS, S3, Lambda, etc.) and Kubernetes orchestration.
- Proficiency in infrastructure-as-code tools such as Terraform and scripting languages like Python.
- Strong background in monitoring, logging, and alerting frameworks (Prometheus, Grafana, ELK/EFK stacks).
- Experience building and maintaining CI/CD pipelines using tools like Jenkins, GitHub Actions, or CircleCI.
Skills
kubernetesawsterraformpythonprometheuscicd