remote
Director - Hyperscale, HPC & Sovereign AI Deployment and Fleet Operations - AMD
Systems Engineer
Lead the design and execution of hyperscale AI and HPC deployments, driving fleet operations and cloud infrastructure to accelerate next‑generation AI workloads across data centers and edge environments.
About the role
Key Responsibilities
- Architect and oversee hyperscale AI and HPC deployment strategies, ensuring high availability, performance, and security across global data centers.
- Lead cross‑functional teams in the design, implementation, and optimization of AI infrastructure, including GPU clusters, networking, and storage solutions.
- Develop and maintain fleet operations processes, automating provisioning, monitoring, and lifecycle management of AI workloads using Kubernetes and cloud-native tools.
- Collaborate with product, research, and security teams to integrate cutting‑edge AI models and ensure compliance with sovereign data regulations.
- Drive continuous improvement initiatives, leveraging metrics and analytics to enhance system efficiency, cost‑effectiveness, and scalability.
Requirements
- 10+ years of experience in large‑scale AI or HPC infrastructure, with a proven track record of leading complex deployments.
- Deep expertise in cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, OpenShift).
- Strong understanding of GPU architecture, high‑performance networking, and storage technologies.
- Excellent leadership, communication, and stakeholder management skills.
- Experience with sovereign AI compliance and data residency requirements is a plus.