onsite
Senior Systems Reliability Engineer I - ThoughtSpot
Software Engineer
Senior SRE who blends deep systems expertise with AI/ML‑driven operational intelligence, delivering reliable, self‑optimizing services while serving as the primary technical liaison for customers.
About the role
Key Responsibilities
- Own end‑to‑end reliability of cloud‑native services, including monitoring, alerting, and incident response.
- Develop and maintain automation scripts and tooling (Python, Terraform, CI/CD pipelines) to reduce manual toil.
- Collaborate with product and engineering teams to design resilient architectures on AWS and Kubernetes.
- Leverage machine‑learning models to predict failures, generate proactive recommendations, and improve service health.
- Act as the primary technical point of contact for customers, translating their needs into actionable engineering solutions.
Requirements
- 5+ years of experience in site reliability, DevOps, or production engineering.
- Strong proficiency in Python and scripting for automation.
- Hands‑on experience with Kubernetes, container orchestration, and AWS services.
- Familiarity with observability stacks such as Prometheus, Grafana, and log aggregation tools.
- Demonstrated ability to apply machine‑learning techniques for predictive operations and to communicate complex technical concepts to customers.
Skills
pythonkubernetesawsprometheusmachine learning