onsite
Site Reliability Engineer II - Pros.
Site Reliability Engineer
Senior Site Reliability Engineer focused on building resilient, scalable infrastructure for AI‑driven airline retail platforms using Kubernetes, Docker, AWS, and Terraform to ensure high availability and performance.
About the role
Key Responsibilities
- Design, implement, and maintain highly available Kubernetes clusters that support real‑time pricing and merchandising services.
- Automate infrastructure provisioning and configuration using Terraform and CI/CD pipelines to accelerate feature delivery.
- Monitor system health with Prometheus and Grafana, proactively identifying and resolving performance bottlenecks.
- Collaborate with development teams to embed reliability best practices into application code and deployment workflows.
- Lead incident response, root cause analysis, and post‑mortem documentation to continuously improve system resilience.
Requirements
- 5+ years of experience in site reliability or DevOps roles within high‑traffic, data‑intensive environments.
- Proficiency with Kubernetes, Docker, and cloud platforms (AWS preferred).
- Strong scripting skills (Python, Bash) and experience with IaC tools like Terraform.
- Hands‑on experience with monitoring, alerting, and log aggregation (Prometheus, Grafana, ELK).
- Excellent problem‑solving abilities and a collaborative mindset.
Skills
kubernetesdockerawsterraform