Founding Senior AI Platform / SRE Engineer
As a Founding Senior AI Platform / SRE Engineer, you will be responsible for ensuring the reliability, deployment, and operational efficiency of Naiian's AI platform. This role involves building robust infrastructure, implementing comprehensive observability, and managing the cost and performance of AI workloads, with a strong emphasis on operational ownership and strategic decision-making.
Naiian is a well-funded European Deep Tech startup with a team in Madrid, founded by experienced professionals in product, applied AI, and engineering in critical environments. We build for clients operating in contexts of high operational and decisional demands, where auditability, integration with verifiable sources, and human approval mechanisms for sensitive tasks are not just features, but the foundation.
We are in a foundational phase. The people joining now will define the architecture, code, and technical culture the company will inherit for years to come.
You will ensure that the infrastructure, deployments, and AI serving layer continue to function when the product is no longer a demo. You will be responsible for reliability, deployment, observability, capacity management, autoscaling, recovery, CI/CD, cost engineering, model serving support, and the foundations of operational model routing.
The reason this role exists is concrete: in an AI platform, reliability is not just availability. It is also about controlling latency, cost, rate limits, provider failures, timeouts, worker saturation, capacity planning, inference observability, and fallback routes. If any of these elements are ignored, the platform will fail or become unaffordable — and both problems kill startups.
Part of the job is to operate managed APIs with discretion, and to pave the way for using self-hosted open-weight models where it makes sense. You don't have to be a low-level GPU expert, but you do need to understand how to operate AI workloads with reliability, cost, and portability.
We work in-person in Madrid. It's a conscious decision: in the foundational phase, the speed of iteration and the quality of technical decisions made on a shared whiteboard are difficult to replicate remotely.
We operate with little process and a lot of responsibility. Whoever deploys a system also operates it. Whoever defines an SLO is also responsible when it breaks. We do not treat infrastructure as a separate layer of the product — it is part of the product, with its own owners, metrics, and trade-offs.
The quality criterion is set by reality: Can it handle real load? Can it be debugged under pressure? Does it control cost, not just availability? Could we change providers without rewriting everything? If the answer to any is “no,” go back to the drawing board.
More than a rigid profile, we are looking for a set of demonstrable competencies:
Posted June 11, 2026