remote
Site Reliability Engineer - Bigstone Health Commission
Site Reliability Engineer
Lead the design, deployment, and operation of scalable, highly available cloud services using Kubernetes, Docker, and AWS, while implementing robust CI/CD pipelines and monitoring solutions to ensure reliability and performance.
About the role
Key Responsibilities
- Design, build, and maintain production-grade infrastructure on AWS, leveraging services such as EC2, EKS, RDS, and S3.
- Implement and manage Kubernetes clusters, ensuring high availability, auto‑scaling, and secure networking.
- Develop and maintain CI/CD pipelines with GitHub Actions, Jenkins, or ArgoCD to automate application deployments and rollbacks.
- Monitor system health using Prometheus, Grafana, and CloudWatch; respond to incidents and conduct post‑mortem analyses.
- Collaborate with development teams to enforce best practices for code quality, security, and observability.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles.
- Proficient with Kubernetes, Docker, and AWS services.
- Strong scripting skills in Python or Bash for automation.
- Experience with CI/CD tooling and infrastructure as code (Terraform, CloudFormation).
- Excellent problem‑solving skills and a proactive, collaborative mindset.
Skills
kubernetesdockerawscicdpython