remoteonsite
Site Reliability Engineer - 66degrees
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, deploying, and maintaining highly available cloud-native infrastructure using Kubernetes, Docker, and AWS. Drives automation, observability, and incident response to ensure seamless application delivery.
About the role
Key Responsibilities
- Design, implement, and manage scalable Kubernetes clusters on AWS, ensuring high availability and performance.
- Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines.
- Implement monitoring, alerting, and logging with Prometheus, Grafana, and ELK stack to maintain system health.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to improve reliability.
- Collaborate with development teams to embed SRE best practices into application lifecycle.
Requirements
- 5+ years of experience in site reliability or DevOps roles.
- Proficient with Kubernetes, Docker, and AWS services (EKS, EC2, S3, CloudWatch).
- Strong scripting skills (Python, Bash) and experience with Terraform.
- Hands‑on experience with CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Excellent problem‑solving skills and a proactive approach to automation and reliability.
Skills
kubernetesdockerawsterraformprometheusgrafanacicd