onsite

Founding Senior AI Platform / SRE Engineer

As a Founding Senior AI Platform / SRE Engineer, you will be responsible for ensuring the reliability, deployment, and operational efficiency of Naiian's AI platform. This role involves building robust infrastructure, implementing comprehensive observability, and managing the cost and performance of AI workloads, with a strong emphasis on operational ownership and strategic decision-making.

About the role

About Naiian

Naiian is a well-funded European Deep Tech startup with a team in Madrid, founded by experienced professionals in product, applied AI, and engineering in critical environments. We build for clients operating in contexts of high operational and decisional demands, where auditability, integration with verifiable sources, and human approval mechanisms for sensitive tasks are not just features, but the foundation.

We are in a foundational phase. The people joining now will define the architecture, code, and technical culture the company will inherit for years to come.

The Role and Why It Exists

You will ensure that the infrastructure, deployments, and AI serving layer continue to function when the product is no longer a demo. You will be responsible for reliability, deployment, observability, capacity management, autoscaling, recovery, CI/CD, cost engineering, model serving support, and the foundations of operational model routing.

The reason this role exists is concrete: in an AI platform, reliability is not just availability. It is also about controlling latency, cost, rate limits, provider failures, timeouts, worker saturation, capacity planning, inference observability, and fallback routes. If any of these elements are ignored, the platform will fail or become unaffordable — and both problems kill startups.

Part of the job is to operate managed APIs with discretion, and to pave the way for using self-hosted open-weight models where it makes sense. You don't have to be a low-level GPU expert, but you do need to understand how to operate AI workloads with reliability, cost, and portability.

What You Will Build in the First 6 Months

Reproducible development, staging, and production environments, with robust CI/CD, IaC, and rollback readiness from the start.
End-to-end observability: logs, metrics, traces, alerts, and operational dashboards that allow for debugging real incidents, not just decorating screens.
Initial SLOs, serious on-call, incident response, and postmortems — the operational foundation before beta.
Autoscaling, capacity planning, and cost controls. Availability without cost and latency control is not enough in an AI platform.
The foundations of inference endpoints, model serving, and operational model routing — including fallback routes in case of provider failure or saturation.
Functional metrics for AI workloads: by model, provider, tenant, task, cost, and latency. Without these metrics, there is no way to manage cost or quality.
Clear separation between control plane and inference plane, in coordination with the founding team.

How We Work

We work in-person in Madrid. It's a conscious decision: in the foundational phase, the speed of iteration and the quality of technical decisions made on a shared whiteboard are difficult to replicate remotely.

We operate with little process and a lot of responsibility. Whoever deploys a system also operates it. Whoever defines an SLO is also responsible when it breaks. We do not treat infrastructure as a separate layer of the product — it is part of the product, with its own owners, metrics, and trade-offs.

The quality criterion is set by reality: Can it handle real load? Can it be debugged under pressure? Does it control cost, not just availability? Could we change providers without rewriting everything? If the answer to any is “no,” go back to the drawing board.

What We Are Looking For

More than a rigid profile, we are looking for a set of demonstrable competencies:

Real track record of production ownership — you have managed serious incidents, made decisions under pressure, and can explain what you learned.
Solid in CI/CD and IaC (Terraform, Pulumi, or equivalents). Reproducible deployments, rollback readiness, not “it works on my machine.”
True observability: logs, metrics, traces, alerts. Ability to debug under pressure with tools like OpenTelemetry, Datadog, Grafana/Prometheus, or equivalents.
Capacity planning and cost engineering — you understand that reliability without cost control is not sustainable reliability.
Comfortable with AWS, Docker, Kubernetes (EKS/ECS) or equivalents. Basic networking, IAM, secrets, hardening.
Practical understanding of model serving and AI workload operation, even if you are not a low-level GPU expert.
Judgment on when to use managed APIs and when to prepare self-hosted — you understand the trade-offs of cost, latency, reliability, privacy, and portability.
Professional level of Spanish, linked to the nature of the position, and functional English to work in a bilingual team.

Bonus Points If

You have hands-on experience with vLLM, SGLang, Triton, TGI, Ray Serve, KServe, SageMaker, Bedrock, or equivalent model serving solutions.
You have worked with GPU workloads, inference, batch processing, or high-throughput systems in production.
You have operated multi-tenant systems with sensitive data or compliance requirements.
You have experience with FinOps or cloud cost engineering — not just monitoring, but acting on cost.
You come from fintech, enterprise SaaS, data platforms, or high-load systems.

Founding Senior AI Platform / SRE Engineer

About the role

About Naiian

The Role and Why It Exists

What You Will Build in the First 6 Months

How We Work

What We Are Looking For

Bonus Points If

Founding Senior AI Platform / SRE Engineer

About the role

About Naiian

The Role and Why It Exists

What You Will Build in the First 6 Months

How We Work

What We Are Looking For

Bonus Points If

Skills