onsite
Site Reliability Engineer - Clearwater Analytics (CWAN)
Site Reliability Engineer
Site Reliability Engineer focused on building automated, observable, and standardized infrastructure for a rapidly scaling AI‑powered risk analytics platform, leveraging Python, Kubernetes, Terraform, and AWS.
About the role
Key Responsibilities
- Design, implement, and maintain automated provisioning and configuration pipelines for client environments using Terraform and CI/CD tools.
- Develop and enhance observability solutions (metrics, logging, tracing) with Prometheus, Grafana, and related technologies to ensure high availability.
- Collaborate with development and client‑facing teams to translate incident learnings into permanent platform improvements.
- Manage and optimize Kubernetes clusters on AWS, ensuring scalability, security, and cost efficiency.
- Write production‑grade Python scripts and utilities to automate repetitive operational tasks.
Requirements
- 3+ years of experience in site reliability or DevOps roles, preferably in cloud‑native environments.
- Strong proficiency in Python for automation and tooling.
- Hands‑on experience with Kubernetes orchestration and AWS services (EC2, EKS, S3, IAM).
- Expertise in infrastructure‑as‑code using Terraform and CI/CD pipelines (Jenkins, GitHub Actions, or similar).
- Solid understanding of monitoring, alerting, and incident response frameworks (Prometheus, Grafana, PagerDuty).
Skills
pythonkubernetesterraformawscicdprometheus