remote
Cloud Reliability Engineer II - ThoughtSpot
Software Engineer
Senior Cloud Reliability Engineer focused on ensuring high availability, security, and performance of a SaaS platform using AWS, Kubernetes, and automation tools to manage incidents, monitor systems, and drive continuous improvement.
About the role
Key Responsibilities
- Lead day‑to‑day operations for a production SaaS platform, ensuring 99.99% uptime and compliance with SLAs.
- Own incident lifecycle: triage, root‑cause analysis, post‑mortem documentation, and preventive actions.
- Design, implement, and maintain monitoring, alerting, and observability stacks (Prometheus, Grafana, CloudWatch).
- Automate repetitive tasks and infrastructure provisioning using IaC (Terraform, CloudFormation) and scripting (Python, Bash).
- Collaborate with security and compliance teams to enforce best practices and audit readiness.
- Drive continuous improvement initiatives, including capacity planning, cost optimization, and reliability metrics.
Requirements
- 5+ years of experience in cloud operations or SRE roles, preferably in a SaaS environment.
- Proficient with AWS services (EC2, RDS, S3, CloudWatch) and container orchestration (Kubernetes).
- Strong scripting skills in Python or Bash and experience with IaC tools.
- Excellent problem‑solving, communication, and collaboration abilities.
- Certifications such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator are a plus.