Agent-eval design is the new bottleneck role. -- Blog

In the last six months, every frontier lab we work with as a hiring partner has opened at least one senior IC role with the phrase “agent eval” somewhere in the description. Most have opened more than one. The roles are well-defined, well-funded, and — this is the surprising part — very hard to fill. The pipeline of engineers who can actually do this work is thinner than the demand by roughly the largest margin we’ve seen for any technical role since we started tracking the AI market. This piece is about why.

The short answer: agent eval design is research, not engineering, and it sits at an awkward intersection of three skill sets that don’t usually live in the same person. The long answer is more interesting, and worth working through if your trajectory has anything to do with agents over the next year or two.

// 01Why every lab needs this now

The shape of frontier AI work has shifted noticeably in 2026. A year ago the most consequential research at the labs we track was post-training — RLHF, DPO, preference data, IFEval. Today the most consequential research is agent work: tool-use, multi-step planning, long-horizon execution, reliable function-calling. Some of this is product-pull (Claude Code, agent-platform launches, browser agents), some is research-pull (the math suggests agents are how you get utility out of the next generation of models). Either way, the shift is real, and the labs are organized around it now.

The problem is that all of the agent-research work the labs are doing depends on something they don’t have enough of: evals that actually catch the failure modes that matter in production. Without that, the team can’t tell whether the new tool-use policy is better than the old one, whether the new planner generalizes, whether last week’s release broke something subtle. Agent research without good agent eval is research without a feedback loop. The labs know this.

So they’re hiring. And they’re hitting the same problem that, frankly, we hit at the network when we tried to build our own agent-eval capability last year: the people who can do this well are rarer than the people who can do almost any other senior research role.

// 02Why traditional model evals miss

The standard playbook from the post-training world doesn’t transfer cleanly. A traditional model eval looks like this: you have a frozen dataset of inputs, the model produces outputs, you score the outputs against ground truth, you get a number. The eval is reproducible, the math is clean, the rankings between models are stable enough to drive training decisions.

Agents break every part of that pattern. The inputs aren’t frozen — they include tool definitions, environment state, conversation history, and a planner that can revise its own context. The outputs aren’t single-shot — they’re sequences of (sometimes hundreds of) tool calls, branches, retries, recoveries. There is no clean ground truth for “did the agent solve this task” because the task was open-ended and the trajectory through which the agent solved it (or failed) is itself part of what you care about.

And the failure modes are different. The taxonomy that agent-eval engineers actually use to organize their work doesn’t look anything like the eval landscape from the post-training world. Eight failure modes show up over and over in our conversations with eval engineers at frontier labs:

The eight production agent failure modes

// FROM 9 FRONTIER-LAB ENGINEERS WE TALKED TO · Q1 2026

// FAILURE 01

Schema drift

The tool’s signature changed; the agent’s call no longer parses. Hard to catch with frozen evals because the schemas evolve.

// FAILURE 02

Argument hallucination

The agent fabricates plausible-looking but incorrect arguments — a fake user_id, a date that doesn’t exist.

// FAILURE 03

Looping

The agent calls the same tool repeatedly with small variations, never making progress. Costs money and patience.

// FAILURE 04

Premature termination

The agent declares success after one tool call when the user’s actual request required three.

// FAILURE 05

Recovery failure

The first tool call errors. The agent doesn’t notice, or notices and gives up.

// FAILURE 06

Plan abandonment

Mid-execution, the agent forgets the original goal and pursues a sub-goal that became salient in context.

// FAILURE 07

Tool-choice drift

The agent picks the wrong tool for the request — usually a more general tool when a specific one is available.

// FAILURE 08

Context collapse

By step 30, the context window is full of prior tool outputs and the agent has lost track of the original user intent.

None of these are the kind of thing you catch with a frozen-dataset eval. They are all trajectory-levelfailures. To catch them in an eval, you need an eval that can simulate trajectories — which means you need a simulator, which means you need to maintain that simulator’s tools, which means the eval starts looking a lot more like a small distributed system than a static benchmark. That’s the work.

// 03What good agent-eval design looks like

The strong engineers we’ve placed into agent-eval roles do roughly the same five things in roughly the same order. None of them is impossible. Together they’re a coherent skill stack that takes most engineers six to nine months of focused work to develop, and most engineers don’t get the opportunity to do that focused work unless their current role specifically asks for it.

First, they build a real environment. Not a frozen test set — an environment with stateful tools, simulated APIs, and a way to seed and replay scenarios. This is more engineering than research, and it’s where most early agent-eval efforts stall: the team can build the agent but not a tractable place to test it.

Second, they build a verifier that catches the eight failure modes above — not by checking for an end-state success/fail, but by inspecting the trajectory. A good verifier flags looping, recovery failure, schema drift, etc. as separate signals, not just “this run got 0 points.”

Third, they design a scenario library that hits the failure modes intentionally. The scenarios that surface looping are not the same scenarios that surface argument hallucination. You need each category to be represented, in proportion to how often it matters in production. This is where research taste becomes load-bearing.

Fourth, they make the eval cheap enough to run on every PR. A trajectory-level eval is expensive (LLM calls per scenario × scenarios). If it’s too expensive, it runs nightly, which means the regression detection feedback loop is 24 hours, which means engineers learn to ignore it. Good agent-eval design treats cost-per-run as a first-class constraint.

Fifth, they ship the dashboard that makes the regressions readable. “The eval went down 3.2 points” is useless. “Eval went down 3.2 points; the drop is entirely on the recovery-failure axis, concentrated in scenarios involving the search-then-write tool sequence” is actionable. The dashboard is half the value.

// SAMPLE TRAJECTORY-LEVEL VERIFIER SCHEMAscenario_id: 0142 # multi-tool with intentional error in step 2
trajectory_length: 7 steps
failure_signals:
  schema_drift: false
  arg_hallucination: false
  looping: false
  premature_termination: false
  recovery_failure: true # did not catch step-2 error
  plan_abandonment: false
  tool_choice_drift: false
  context_collapse: false
end_state: “task_unfinished_silent_failure”

// 04The market consequence

Because the skill stack takes time to build and because the roles are urgent, the comp profile for agent-eval research engineers has moved sharply. Senior IC offers in our dataset for agent / RL research (which subsumes most agent-eval roles) cleared +24% YoY in May 2026 — the fastest-rising research track in our data, and one of the two fastest-rising roles overall (the other being inference engineering, covered in last week’s piece).

The premium is not just compensation. Engineers who land an agent-eval research role at a frontier lab in 2026 land into a level of internal visibility that’s hard to overstate. The lab cares about the work; the planning conversations that depend on the eval go through them; the dashboard they ship gets opened by senior leadership every morning. We have placed engineers into these roles who, six months in, are presenting to the lab’s executive team weekly because their eval is the metric the lab’s next release goes against.

An agent-eval engineer at one of our partner labs told us last quarter: “I went from being the eighteenth person on a post-training team to being the only person at the lab who can answer the question the model release goes/no-goes on. It’s the same comp band. It’s not the same job.”

// FROM AN INTERVIEW · SENIOR IC · Q1 2026

// 05What this means for engineers

The audience reading this is mostly going to fall into one of three groups. Different reads for each.

If you’re currently doing post-training or RLHF research, the agent-eval skill stack is closer than you think. The verifier-design work overlaps heavily with reward-model design (you’re building a graded signal over a trajectory either way). The scenario-library work overlaps heavily with eval design more generally. The translation cost from a strong post-training engineer to a strong agent-eval engineer is small, and the comp premium is real. Worth a serious look.

If you’re currently doing ML engineering or platform work, the simulation-environment side is where you have a real edge over the typical post-training engineer. The hardest part of agent eval, for most teams, is making the environment reliable, cheap, and reproducible. That’s mostly an engineering problem, and it’s the half of the work most labs are weakest at. If you’ve ever owned a distributed system with state and replays, you’re closer than you’d guess.

If you’re currently doing applied AI work with production agents, you’ve probably already encountered most of the failure modes in the taxonomy above — in the worst possible setting, on the receiving end of a customer complaint. Codifying what you already know about agent failure modes into a research-grade eval is a smaller jump than it sounds. We’ve seen several applied AI engineers in our network make this transition over the past year, and they tend to make strong agent-eval researchers because they’ve earned their failure-mode intuitions the hard way.

// FROM POST-TRAINING

Reuse your reward-modeling instincts; the verifier is a graded trajectory signal

The hard part is the simulator and the cost-per-run constraint. Find a partner with platform-engineering depth and you’re 70% of the way.

// FROM ML ENGINEERING

Your edge is the environment. Most teams underbuild the simulator side.

Ship one with stateful tools, replay, and seeding. The research half of the role becomes much easier to learn once that exists.

// FROM APPLIED AI

Codify the production failures you've already seen

The failure-mode taxonomy is mostly intuition you’ve already developed. Writing it down formally is the move.

// FROM EARLY-CAREER

Open-source an eval harness in this space

The bar for getting into agent-eval roles is artifact-shipped. One serious public harness is enough to clear the screening for many of the labs that are currently hiring.

If you’d rather not change track but want to make sure your existing work is calibrated against the current market, Cohire Copilot will tell you where you sit relative to the agent specialization and what specifically would close the gap.

Next week’s piece will cover the four mistakes engineers make most often on the 13-day project stage of our screening. Subscribe at the bottom of the blog page if you want it.

The OpenTalent Network

// EDITORIAL · RESEARCH

// Read next

// FLAGSHIP

What the May 2026 frontier-hiring data actually says

5,800 senior-IC offers across 52 companies. Three numbers worth knowing.

READ →// ROLE TRACK

AI Research roles

The eight research specializations frontier labs are hiring against. Agent/RL is one.

READ →// FEATURE

Cohire Copilot

Maps you against the agent specialization. Tells you what would close the gap.

READ →

If agent eval is where you’re heading.

The labs hiring for this role share their listings with the network before the public boards. Apply to the network.

Apply to the network Open Cohire