remote
Senior Cloud Site Reliability Engineer - Solace
Site Reliability Engineer
Senior SRE leading reliability for cloud‑native, event‑driven platforms. Design, automate, and operate scalable infrastructure on AWS using Kubernetes, Terraform, Python and CI/CD pipelines while ensuring high availability and observability.
About the role
Key Responsibilities
- Design, build, and maintain highly available, scalable Kubernetes clusters on AWS for real‑time event streaming services.
- Automate infrastructure provisioning and configuration management using Terraform and Python scripts.
- Implement and manage CI/CD pipelines to enable rapid, reliable deployments and rollbacks.
- Develop comprehensive monitoring, logging, and alerting solutions to ensure service reliability and performance.
- Collaborate with development and product teams to define SLOs/SLAs and drive incident response and post‑mortem processes.
Requirements
- 5+ years of experience in site reliability or DevOps engineering, preferably in cloud‑native environments.
- Strong expertise with Kubernetes, AWS services (EKS, EC2, RDS, S3), and infrastructure‑as‑code tools such as Terraform.
- Proficiency in scripting/automation using Python and familiarity with CI/CD tools (Jenkins, GitLab CI, GitHub Actions).
- Hands‑on experience with observability stacks (Prometheus, Grafana, ELK, CloudWatch) and incident management.
- Solid understanding of networking, security, and high‑availability architectures for event‑driven systems.
Skills
kubernetesawsterraformpythoncicd