onsite
Senior Site Reliability Engineer - Arkestro
Site Reliability Engineer
Lead the reliability and scalability of a data‑driven procurement platform, designing and maintaining cloud infrastructure, monitoring, and automation to ensure high availability and performance.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, highly available infrastructure on AWS using Terraform and CloudFormation.
- Manage containerized workloads with Kubernetes, ensuring efficient deployment, scaling, and rolling updates.
- Build and maintain observability stack (Prometheus, Grafana, Loki) for real‑time monitoring, alerting, and incident response.
- Automate CI/CD pipelines with GitHub Actions and ArgoCD, enforcing best practices for code quality and deployment speed.
- Collaborate with development teams to embed reliability principles into application design and code reviews.
- Lead post‑mortem analyses, root cause investigations, and implement preventive measures to reduce MTTR.
Requirements
- 5+ years of SRE or DevOps experience in a production environment.
- Proficient with Kubernetes, Docker, and cloud-native tooling.
- Strong scripting skills in Python or Bash for automation.
- Hands‑on experience with AWS services (EC2, RDS, S3, CloudWatch, IAM).
- Solid understanding of monitoring, alerting, and incident management practices.
Skills
kubernetesdockerawsterraformprometheusgrafanapython