remote
Senior Site Reliability Engineer - BillingPlatform
Site Reliability Engineer
Lead the reliability and scalability of a cloud‑native SaaS platform, driving automation, performance, and uptime using Kubernetes, AWS, Terraform, and advanced monitoring tools.
About the role
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure for a global SaaS revenue lifecycle platform on AWS.
- Automate deployment pipelines with CI/CD, Terraform, and GitOps practices to ensure rapid, reliable releases.
- Implement and manage Kubernetes clusters, ensuring optimal resource utilization, security, and resilience.
- Develop and maintain observability stack (Prometheus, Grafana, Loki, etc.) for real‑time monitoring, alerting, and capacity planning.
- Lead incident response, root‑cause analysis, and post‑mortem processes to continuously improve system reliability.
- Collaborate with development, security, and product teams to embed SRE principles across the organization.
Requirements
- 5+ years of experience in site reliability or DevOps roles, preferably in SaaS environments.
- Deep expertise with AWS services (EC2, RDS, S3, EKS, CloudWatch) and Kubernetes cluster management.
- Proficient in infrastructure as code using Terraform and automation with CI/CD pipelines.
- Strong scripting skills in Python or Bash for automation and tooling.
- Experience with monitoring, logging, and alerting solutions (Prometheus, Grafana, Loki, ELK).
Skills
kubernetesawsterraformcicd