In the last six months, every frontier lab we work with as a hiring partner has opened at least one senior IC role with the phrase “agent eval” somewhere in the description. Most have opened more than one. The roles are well-defined, well-funded, and — this is the surprising part — very hard to fill. The pipeline of engineers who can actually do this work is thinner than the demand by roughly the largest margin we’ve seen for any technical role since we started tracking the AI market. This piece is about why.
The short answer: agent eval design is research, not engineering, and it sits at an awkward intersection of three skill sets that don’t usually live in the same person. The long answer is more interesting, and worth working through if your trajectory has anything to do with agents over the next year or two.
// 01Why every lab needs this now
The shape of frontier AI work has shifted noticeably in 2026. A year ago the most consequential research at the labs we track was post-training — RLHF, DPO, preference data, IFEval. Today the most consequential research is agent work: tool-use, multi-step planning, long-horizon execution, reliable function-calling. Some of this is product-pull (Claude Code, agent-platform launches, browser agents), some is research-pull (the math suggests agents are how you get utility out of the next generation of models). Either way, the shift is real, and the labs are organized around it now.
The problem is that all of the agent-research work the labs are doing depends on something they don’t have enough of: evals that actually catch the failure modes that matter in production. Without that, the team can’t tell whether the new tool-use policy is better than the old one, whether the new planner generalizes, whether last week’s release broke something subtle. Agent research without good agent eval is research without a feedback loop. The labs know this.
So they’re hiring. And they’re hitting the same problem that, frankly, we hit at the network when we tried to build our own agent-eval capability last year: the people who can do this well are rarer than the people who can do almost any other senior research role.
// 02Why traditional model evals miss
The standard playbook from the post-training world doesn’t transfer cleanly. A traditional model eval looks like this: you have a frozen dataset of inputs, the model produces outputs, you score the outputs against ground truth, you get a number. The eval is reproducible, the math is clean, the rankings between models are stable enough to drive training decisions.
Agents break every part of that pattern. The inputs aren’t frozen — they include tool definitions, environment state, conversation history, and a planner that can revise its own context. The outputs aren’t single-shot — they’re sequences of (sometimes hundreds of) tool calls, branches, retries, recoveries. There is no clean ground truth for “did the agent solve this task” because the task was open-ended and the trajectory through which the agent solved it (or failed) is itself part of what you care about.
And the failure modes are different. The taxonomy that agent-eval engineers actually use to organize their work doesn’t look anything like the eval landscape from the post-training world. Eight failure modes show up over and over in our conversations with eval engineers at frontier labs:
The eight production agent failure modes
Schema drift
The tool’s signature changed; the agent’s call no longer parses. Hard to catch with frozen evals because the schemas evolve.
Argument hallucination
The agent fabricates plausible-looking but incorrect arguments — a fake user_id, a date that doesn’t exist.
Looping
The agent calls the same tool repeatedly with small variations, never making progress. Costs money and patience.
Premature termination
The agent declares success after one tool call when the user’s actual request required three.
Recovery failure
The first tool call errors. The agent doesn’t notice, or notices and gives up.
Plan abandonment
Mid-execution, the agent forgets the original goal and pursues a sub-goal that became salient in context.
Tool-choice drift
The agent picks the wrong tool for the request — usually a more general tool when a specific one is available.
Context collapse
By step 30, the context window is full of prior tool outputs and the agent has lost track of the original user intent.
None of these are the kind of thing you catch with a frozen-dataset eval. They are all trajectory-levelfailures. To catch them in an eval, you need an eval that can simulate trajectories — which means you need a simulator, which means you need to maintain that simulator’s tools, which means the eval starts looking a lot more like a small distributed system than a static benchmark. That’s the work.
// 03What good agent-eval design looks like
The strong engineers we’ve placed into agent-eval roles do roughly the same five things in roughly the same order. None of them is impossible. Together they’re a coherent skill stack that takes most engineers six to nine months of focused work to develop, and most engineers don’t get the opportunity to do that focused work unless their current role specifically asks for it.
First, they build a real environment. Not a frozen test set — an environment with stateful tools, simulated APIs, and a way to seed and replay scenarios. This is more engineering than research, and it’s where most early agent-eval efforts stall: the team can build the agent but not a tractable place to test it.
Second, they build a verifier that catches the eight failure modes above — not by checking for an end-state success/fail, but by inspecting the trajectory. A good verifier flags looping, recovery failure, schema drift, etc. as separate signals, not just “this run got 0 points.”
Third, they design a scenario library that hits the failure modes intentionally. The scenarios that surface looping are not the same scenarios that surface argument hallucination. You need each category to be represented, in proportion to how often it matters in production. This is where research taste becomes load-bearing.
Fourth, they make the eval cheap enough to run on every PR. A trajectory-level eval is expensive (LLM calls per scenario × scenarios). If it’s too expensive, it runs nightly, which means the regression detection feedback loop is 24 hours, which means engineers learn to ignore it. Good agent-eval design treats cost-per-run as a first-class constraint.
Fifth, they ship the dashboard that makes the regressions readable. “The eval went down 3.2 points” is useless. “Eval went down 3.2 points; the drop is entirely on the recovery-failure axis, concentrated in scenarios involving the search-then-write tool sequence” is actionable. The dashboard is half the value.
trajectory_length: 7 steps
failure_signals:
schema_drift: false
arg_hallucination: false
looping: false
premature_termination: false
recovery_failure: true # did not catch step-2 error
plan_abandonment: false
tool_choice_drift: false
context_collapse: false
end_state: “task_unfinished_silent_failure”
// 04The market consequence
Because the skill stack takes time to build and because the roles are urgent, the comp profile for agent-eval research engineers has moved sharply. Senior IC offers in our dataset for agent / RL research (which subsumes most agent-eval roles) cleared +24% YoY in May 2026 — the fastest-rising research track in our data, and one of the two fastest-rising roles overall (the other being inference engineering, covered in last week’s piece).
The premium is not just compensation. Engineers who land an agent-eval research role at a frontier lab in 2026 land into a level of internal visibility that’s hard to overstate. The lab cares about the work; the planning conversations that depend on the eval go through them; the dashboard they ship gets opened by senior leadership every morning. We have placed engineers into these roles who, six months in, are presenting to the lab’s executive team weekly because their eval is the metric the lab’s next release goes against.
An agent-eval engineer at one of our partner labs told us last quarter: “I went from being the eighteenth person on a post-training team to being the only person at the lab who can answer the question the model release goes/no-goes on. It’s the same comp band. It’s not the same job.”// FROM AN INTERVIEW · SENIOR IC · Q1 2026
// 05What this means for engineers
The audience reading this is mostly going to fall into one of three groups. Different reads for each.
If you’re currently doing post-training or RLHF research, the agent-eval skill stack is closer than you think. The verifier-design work overlaps heavily with reward-model design (you’re building a graded signal over a trajectory either way). The scenario-library work overlaps heavily with eval design more generally. The translation cost from a strong post-training engineer to a strong agent-eval engineer is small, and the comp premium is real. Worth a serious look.
If you’re currently doing ML engineering or platform work, the simulation-environment side is where you have a real edge over the typical post-training engineer. The hardest part of agent eval, for most teams, is making the environment reliable, cheap, and reproducible. That’s mostly an engineering problem, and it’s the half of the work most labs are weakest at. If you’ve ever owned a distributed system with state and replays, you’re closer than you’d guess.
If you’re currently doing applied AI work with production agents, you’ve probably already encountered most of the failure modes in the taxonomy above — in the worst possible setting, on the receiving end of a customer complaint. Codifying what you already know about agent failure modes into a research-grade eval is a smaller jump than it sounds. We’ve seen several applied AI engineers in our network make this transition over the past year, and they tend to make strong agent-eval researchers because they’ve earned their failure-mode intuitions the hard way.
Reuse your reward-modeling instincts; the verifier is a graded trajectory signal
The hard part is the simulator and the cost-per-run constraint. Find a partner with platform-engineering depth and you’re 70% of the way.
Your edge is the environment. Most teams underbuild the simulator side.
Ship one with stateful tools, replay, and seeding. The research half of the role becomes much easier to learn once that exists.
Codify the production failures you've already seen
The failure-mode taxonomy is mostly intuition you’ve already developed. Writing it down formally is the move.
Open-source an eval harness in this space
The bar for getting into agent-eval roles is artifact-shipped. One serious public harness is enough to clear the screening for many of the labs that are currently hiring.
If you’d rather not change track but want to make sure your existing work is calibrated against the current market, Cohire Copilot will tell you where you sit relative to the agent specialization and what specifically would close the gap.
Next week’s piece will cover the four mistakes engineers make most often on the 13-day project stage of our screening. Subscribe at the bottom of the blog page if you want it.
// Read next
What the May 2026 frontier-hiring data actually says
5,800 senior-IC offers across 52 companies. Three numbers worth knowing.
READ →// ROLE TRACKAI Research roles
The eight research specializations frontier labs are hiring against. Agent/RL is one.
READ →// FEATURECohire Copilot
Maps you against the agent specialization. Tells you what would close the gap.
READ →If agent eval is where you’re heading.
The labs hiring for this role share their listings with the network before the public boards. Apply to the network.