remote
IT Principal Operations Engineer - Save A Lot
Systems Engineer
Lead enterprise operations engineering, driving performance optimization, automation, and incident resolution across cloud and on‑prem environments using Linux, AWS, Python, and modern monitoring tools.
About the role
Key Responsibilities
- Architect, deploy, and maintain scalable infrastructure on AWS, ensuring high availability and cost efficiency.
- Develop and manage automation pipelines with Ansible and Terraform to streamline configuration and provisioning.
- Implement and tune monitoring solutions (Prometheus, Grafana) to proactively detect and resolve performance bottlenecks.
- Lead incident response, root‑cause analysis, and post‑mortem documentation to continuously improve service reliability.
- Collaborate with cross‑functional teams to define and enforce service level agreements and operational best practices.
Requirements
- 5+ years of experience in systems engineering or operations roles, with a focus on cloud and Linux environments.
- Proficiency in scripting (Python, Bash) and automation tools (Ansible, Terraform).
- Strong knowledge of AWS services (EC2, RDS, S3, CloudWatch) and networking concepts.
- Hands‑on experience with monitoring and alerting platforms such as Prometheus and Grafana.
- Excellent problem‑solving skills, with a track record of driving process improvements and incident resolution.
Skills
linuxawspythonansibleprometheus