We are building a European AI Infrastructure & Platform Operations team responsible for operating large-scale AI infrastructure environments powered by NVIDIA GPUs, high-performance networking, Kubernetes, and next-generation platform technologies.
As a Senior AI Infrastructure & Platform Operations Engineer, you will serve as a technical leader within the operations organization, providing deep expertise across infrastructure, networking, platform operations, and service reliability. You will be responsible for driving operational excellence across complex production environments while acting as a key escalation point for critical incidents and challenging technical issues.
This role combines hands-on technical operations with technical leadership, helping shape operational standards, reliability practices, automation initiatives, and the future evolution of AI-powered operational services through platforms such as k0rdent AI.
Responsibilities:
Technical Operations & Service Reliability
- Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
- Act as a senior escalation point for operational teams during critical service-impacting events.
- Support large-scale NVIDIA GPU infrastructure and high-performance networking environments.
- Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues.
- Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks.
- Lead root cause analysis activities and drive long-term corrective actions.
- Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges.
- Participate in major incident management and service restoration activities.
Platform Operations & Engineering
- Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
- Drive improvements in platform reliability, observability, monitoring, and operational processes.
- Identify opportunities to automate repetitive operational activities and improve operational efficiency.
- Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions.
- Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI.
- Evaluate emerging technologies and operational practices to improve service delivery and platform resilience.
Technical Leadership
- Mentor and support AI Infrastructure & Platform Operations Engineers.
- Share technical knowledge through documentation, training sessions, and operational reviews.
- Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices.