onsite
SRE Site Reliability Engineer / Platform Engineer - Codevian Technologies Pvt Ltd
Site Reliability Engineer
Lead the design, deployment, and operation of scalable AWS infrastructure, driving Kubernetes (EKS) adoption and GitOps automation to deliver resilient, high‑availability services.
About the role
Key Responsibilities
- Design, provision, and maintain large‑scale AWS environments (EKS, EC2, RDS Aurora, ElastiCache, Control Tower) to support production workloads.
- Lead Kubernetes (multi‑cluster, multi‑environment) migration and day‑to‑day operations, ensuring high availability and performance.
- Implement Infrastructure as Code with Terraform, integrating GitOps tools such as ArgoCD, Atlantis, and custom pipelines for automated, repeatable deployments.
- Define and enforce SLI/SLO frameworks, enhancing observability, monitoring, and incident response across the platform.
- Collaborate with development teams to embed reliability best practices, including auto‑scaling with Karpenter and event‑driven scaling via KEDA.
Requirements
- 5+ years of experience in cloud operations, with deep expertise in AWS and Kubernetes.
- Proficient in Terraform, GitOps workflows, and CI/CD tooling (ArgoCD, Atlantis).
- Strong scripting skills (Python, Bash) and familiarity with monitoring/alerting stacks (Prometheus, Grafana, Alertmanager).
- Hands‑on experience with Karpenter, KEDA, and container orchestration best practices.
- Excellent problem‑solving, communication, and collaboration abilities in a fast‑paced environment.
Skills
awskubernetesterraform