We are building a European AI Infrastructure & Platform Operations team responsible for operating large-scale AI infrastructure environments powered by NVIDIA GPUs, high-performance networking, Kubernetes, and next-generation platform technologies.
The team is responsible for ensuring the availability, performance, and operational stability of critical AI infrastructure platforms deployed across multiple datacenters. Working at the intersection of infrastructure, networking, and platform operations, you will help support the environments that power modern AI workloads.
This is an opportunity to work with some of the latest technologies in AI infrastructure while contributing to the evolution of AI-powered operational services through platforms such as k0rdent AI.
Responsibilities:
- Monitor, operate, and support production AI infrastructure platforms.
- Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
- Support NVIDIA GPU infrastructure and associated platform services.
- Monitor and troubleshoot Kubernetes-based environments.
- Investigate performance, availability, and reliability issues across infrastructure and platform components.
- Collaborate with engineering teams, hardware vendors, datacenter personnel, and service delivery teams to resolve technical issues.
- Participate in incident response, root cause analysis, and operational improvement activities.
- Contribute to improvements in monitoring, observability, automation, and operational processes.
- Maintain operational documentation, runbooks, and knowledge articles.
- 3+ years of experience in infrastructure operations, platform operations, network operations, site reliability engineering, cloud operations, datacenter operations, or related technical roles.
- Strong Linux administration and troubleshooting skills.
- Good understanding of networking concepts and experience diagnosing infrastructure-related issues.
- Working knowledge of Kubernetes in production environments.
- Experience supporting production infrastructure and services.
- Strong analytical and problem-solving skills.
- Experience working within structured operational and incident management processes.
- Excellent communication and collaboration skills.
- Ability to work within a shift-based operational environment.
Experience in one or more of the following areas is highly desirable:
- NVIDIA GPU infrastructure and accelerated computing platforms.
- InfiniBand networking and NVIDIA UFM.
- Kubernetes platform operations.
- AI infrastructure or HPC environments.
- Site Reliability Engineering (SRE) or Platform Engineering.
- Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry.
- Infrastructure automation technologies and Infrastructure-as-Code practi