remote
Site Reliability Engineer II - Medallia
Site Reliability Engineer
Site Reliability Engineer II responsible for designing, deploying, and maintaining highly available SaaS infrastructure on AWS, leveraging Kubernetes, Docker, and Terraform to ensure performance, reliability, and rapid incident resolution.
About the role
Key Responsibilities
- Design, implement, and manage scalable, highly available Kubernetes clusters on AWS to support Medallia’s SaaS platform.
- Automate infrastructure provisioning and configuration using Terraform, ensuring repeatable and auditable deployments.
- Develop and maintain CI/CD pipelines, integrating automated testing, security scanning, and blue‑green deployments.
- Monitor system health with Prometheus, Grafana, and CloudWatch; respond to incidents, conduct post‑mortems, and implement preventive measures.
- Collaborate with development teams to optimize application performance, capacity planning, and cost efficiency.
Requirements
- 3+ years of SRE or DevOps experience in a cloud‑native environment.
- Proficiency with Kubernetes, Docker, and AWS services (EKS, EC2, RDS, S3).
- Hands‑on experience with Terraform, Helm, and CI/CD tools (GitHub Actions, Jenkins, ArgoCD).
- Strong scripting skills in Bash or Python and familiarity with monitoring/alerting tools.
- Excellent problem‑solving, communication, and collaboration abilities.
Skills
kubernetesdockerawsterraform