remote
Lead Application Support Engineer SRE - Marriott Vacations Worldwide
Site Reliability Engineer
Lead SRE responsible for designing, operating, and improving highly available applications, leveraging AWS, Linux, Python automation, Terraform IaC, and monitoring with Prometheus.
About the role
Key Responsibilities
- Design, implement, and maintain scalable, reliable infrastructure for critical applications on AWS.
- Develop and maintain automation scripts and tools using Python and Bash to streamline deployment and incident response.
- Manage infrastructure as code with Terraform, ensuring version‑controlled, repeatable environments.
- Monitor system health and performance using Prometheus, Grafana, and alerting pipelines; lead on‑call rotations and incident triage.
- Collaborate with development and product teams to define SLAs, SLOs, and capacity planning strategies.
Requirements
- 5+ years of experience in Site Reliability Engineering or production support for large‑scale web applications.
- Strong expertise in Linux system administration and networking.
- Proficiency with AWS services (EC2, RDS, S3, Lambda, etc.) and cloud‑native architecture.
- Hands‑on experience with Terraform or similar IaC tools.
- Solid scripting skills in Python (or Bash) and familiarity with monitoring stacks such as Prometheus/Grafana.
Skills
linuxawspythonterraformprometheus