onsite
Senior Site Reliability Engineer - Clearwater Analytics (CWAN)
Site Reliability Engineer
Senior SRE needed to design, automate, and scale the infrastructure for a fast‑growing AI‑powered risk analytics platform, focusing on Kubernetes, AWS, and observability tooling.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, automated infrastructure for multiple client deployments on AWS.
- Develop and enhance Kubernetes‑based platforms, including Helm charts, operators, and custom controllers.
- Create and manage IaC pipelines using Terraform and CI/CD tools to ensure repeatable, version‑controlled provisioning.
- Build observability solutions with Prometheus, Grafana, and logging stacks to provide real‑time monitoring and alerting.
- Collaborate with development and client‑facing teams to translate incident learnings into permanent platform improvements.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, preferably in SaaS environments.
- Strong proficiency in Python for automation and scripting.
- Deep hands‑on experience with Kubernetes orchestration, Helm, and container lifecycle management.
- Expertise in AWS services (EC2, RDS, S3, IAM, VPC) and infrastructure as code using Terraform.
- Solid understanding of monitoring, alerting, and logging frameworks such as Prometheus, Grafana, and ELK.
Skills
pythonkubernetesterraformawscicdprometheus