remote
Senior Site Reliability Engineer - Cvent
Site Reliability Engineer
Senior Site Reliability Engineer responsible for designing, automating, and operating highly available cloud infrastructure, leveraging Kubernetes, AWS, Terraform, and modern monitoring tools to ensure reliability and performance at scale.
About the role
Key Responsibilities
- Design, build, and maintain scalable, highly available services on AWS using infrastructure‑as‑code (Terraform) and container orchestration (Kubernetes).
- Develop automation scripts and tools in Python and Go to streamline deployment, configuration, and incident response workflows.
- Implement robust monitoring, alerting, and observability solutions with Prometheus, Grafana, and logging pipelines to proactively detect and resolve issues.
- Collaborate with development and product teams to define SLOs/SLIs, conduct capacity planning, and drive performance optimizations.
- Lead on‑call rotations, perform root‑cause analysis, and create post‑mortem documentation to continuously improve reliability.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, with a strong focus on cloud platforms (AWS) and container orchestration (Kubernetes).
- Proficiency in scripting/programming languages such as Python and Go.
- Hands‑on experience with infrastructure‑as‑code tools (Terraform, CloudFormation) and CI/CD pipelines (Jenkins, GitLab CI, or similar).
- Deep understanding of Linux systems, networking, and performance tuning.
- Experience with monitoring and observability stacks (Prometheus, Grafana, ELK/EFK) and a track record of implementing SLO/SLI frameworks.
Skills
pythongokubernetesawsterraformprometheuscicdlinux