remote
Platform Reliability Engineer - Manulife
Software Engineer
The Platform Reliability Engineer ensures high availability and performance of Azure Kubernetes Service platforms, drives operational efficiency, and manages incident response, compliance, and security across a multi‑vendor environment.
About the role
Key Responsibilities
- Maintain and enhance Azure Kubernetes Service (AKS) clusters, ensuring scalability, reliability, and optimal performance.
- Develop, document, and improve platform operational procedures and automation scripts.
- Own end‑to‑end incident lifecycle: triage, investigation, resolution, and post‑mortem analysis following ITIL best practices.
- Monitor platform compliance and security posture, implementing remediation actions and hardening measures.
- Collaborate with cross‑functional and multi‑vendor teams to resolve complex client and product issues.
- Identify process inefficiencies and drive continuous improvement initiatives.
Requirements
- Strong experience with Azure cloud services and Azure Kubernetes Service (AKS).
- Solid understanding of ITIL incident, problem, and change management processes.
- Hands‑on expertise in cloud security concepts, compliance frameworks, and vulnerability remediation.
- Proven ability to troubleshoot and resolve production incidents in a multi‑vendor environment.
- Excellent communication skills and a customer‑focused mindset for end‑to‑end issue ownership.
Skills
azurekubernetesitil