remote
Lead Cloud Operations & SRE Engineer - UST
Site Reliability Engineer
Lead the Cloud Operations and Site Reliability Engineering function, defining technical strategy, automating infrastructure, and ensuring high‑availability services on AWS using Kubernetes, Terraform, CI/CD pipelines, and advanced monitoring.
About the role
Key Responsibilities
- Define and drive the technical roadmap for cloud operations and SRE across multiple production environments.
- Architect, implement, and maintain highly available, scalable infrastructure on AWS using Kubernetes, Terraform, and IaC best practices.
- Design, build, and optimize CI/CD pipelines to enable rapid, reliable deployments.
- Implement comprehensive monitoring, alerting, and incident response processes to achieve SLO/SLA targets.
- Mentor and lead a team of engineers, fostering a culture of automation, reliability, and continuous improvement.
Requirements
- 5+ years of hands‑on experience in cloud operations, site reliability engineering, or related roles.
- Deep expertise with AWS services, Kubernetes orchestration, and infrastructure‑as‑code tools such as Terraform.
- Proficiency in scripting or programming (e.g., Python) for automation and tooling.
- Strong background in CI/CD pipeline creation, monitoring solutions (Prometheus, Grafana, CloudWatch), and incident management.
- Demonstrated leadership ability to guide technical teams and influence cross‑functional stakeholders.
Skills
awskubernetesterraformcicdpython