The seven failure modes of production tool-using agents. -- Blog

Production tool-using agents fail in a small number of recognizable ways. We’ve cataloged the seven that show up most often in our conversations with engineers shipping agents at frontier labs and AI-native companies, and we use this taxonomy when we screen for agent roles — partly because it organizes the conversation, and partly because the candidates who can name and detect these failures are usually the ones who have actually shipped agents to real users. If you’re applying to an agent-engineering role this quarter, you should be able to talk about all seven cold.

A note on framing: a few weeks ago we wrote about agent-eval design as the new bottleneck role, where we covered eight failure modes from the perspective of the engineer building trajectory-level evals. This piece is the practitioner-side companion — the failures named by the engineers who’ve encountered them in production, with the mitigation patterns that have actually worked. There’s overlap, but the framing is different. The eval engineer wants to detect; the practitioner wants to prevent.

// THE TAXONOMY · AT A GLANCE

Schema drift

Tool signature changed; the call no longer parses.

~18%

Argument hallucination

Agent fabricates plausible but wrong arguments.

~22%

Looping

Same tool called repeatedly with tiny variations.

~14%

Premature termination

Agent declares success after one step; user wanted three.

~11%

Recovery failure

A tool errored. Agent didn’t notice or didn’t try again.

~13%

Plan abandonment

Agent forgets the original goal mid-execution.

~9%

Context collapse

Context window fills; the user’s intent is lost.

~13%

Percentages are share of production agent incidents we’ve seen catalogued across the labs in our network during Q1 2026 — they sum to 100% by construction (we coded each incident to one primary mode). Take them as rough shape, not precision data.

// 01Schema drift

// Failure 01~18% OF INCIDENTS

The tool signature changed. The agent's call no longer parses.

The tool team shipped a small change to the function signature — added a required field, renamed a parameter, tightened a type. The agent’s model was trained or prompted against the old schema. The call goes out malformed. The error often surfaces as a confusing 4xx from the tool service, hours after the deploy that broke it.

// WHAT THE AGENT EMITTED · POST-DEPLOYtool: “search_orders”
args: { user: “alice@x.com”, range: “7d” }
# tool now requires user_id (not user) and date_from/date_to
→ 400 invalid_argument: missing required field “user_id”

// MITIGATIONTreat tool schemas as a versioned contract; pin agent prompts to a schema version; emit a CI failure when the tool service ships a non-backward-compatible change without bumping the schema version. The teams that have a schema-version registry report schema drift incidents down ~70% YoY.

// 02Argument hallucination

// Failure 02~22% OF INCIDENTS

The agent fabricates plausible-looking but invalid arguments.

The most common failure in production. The agent’s call parses, the schema is right — but the valuesare made up. A fictitious order_id. A date that doesn’t exist. A user_id with the right format but no corresponding user. The agent has hallucinated values it had no way to know. Often this happens when the agent is asked to do something it wasn’t given the inputs to do, and rather than asking, it confabulates.

// EXAMPLE TRACEuser: “refund the latest order from Priya”
agent_call: refund_order(order_id=“ORD-829417”)
# agent never called list_orders for Priya; ORD-829417 doesn’t exist
→ 404 order_not_found

// MITIGATIONConstrained decoding against a verified enum of valid IDs when the tool requires references to existing entities. Where that's not possible, a "first-call-must-be-lookup" planner constraint reduces this failure mode dramatically — typically 60–70% in our partner-lab data. Free-text reasoning prompts to "think step by step" do not help here. They make it worse.

// 03Looping

// Failure 03~14% OF INCIDENTS

The agent calls the same tool repeatedly with tiny variations.

The agent’s first call fails or returns nothing useful. The agent’s second call is almost identical, with a slightly different parameter — a different date format, a slightly different query string. The third call is again almost identical. The agent has confused “try again” with “make progress.” If you don’t cap the loop, you discover it on the invoice.

// LOOP DETECTEDstep_1: search(q=“priya order 2026-03-12”) → 0 results
step_2: search(q=“priya order march 12 2026”) → 0 results
step_3: search(q=“priya 12 march 2026”) → 0 results
# ... step_14, $0.42 spent so far

// MITIGATIONPer-tool call budgets, semantic similarity check against the last N calls (not just exact equality), and an explicit "you are looping — try a different tool or ask the user" instruction triggered when N consecutive calls produce no new information. The semantic similarity check is the part that's usually missing; the exact-equality check is too lenient.

// 04Premature termination

// Failure 04~11% OF INCIDENTS

The agent declares success after one tool call when the user wanted three.

User asked the agent to “find Priya’s last order and refund it.” Agent ran list_orders, got a list, said “Here are Priya’s orders.” The agent never ran the refund. From the agent’s perspective, it answered the question. From the user’s perspective, the agent didn’t do the work.

// EXAMPLE TRACEuser: “find priya’s last order and refund it”
step_1: list_orders(user=“priya@x.com”) → [ORD-901, ORD-882, ...]
agent_reply: “Found 14 orders for Priya. The most recent is ORD-901.”
# refund never executed; user has to ask again

// MITIGATIONA planner that decomposes the user request into a list of sub-goals on the first turn, and explicitly tracks unfinished sub-goals on every subsequent turn. Adding a "did you complete every step the user asked for" verification before the final reply catches roughly half of these in the systems we've seen.

// 05Recovery failure

// Failure 05~13% OF INCIDENTS

A tool errored. The agent didn't notice — or noticed and gave up.

The first tool call returns a 500. The agent’s next reply summarises a successful outcome anyway — sometimes because the prompt didn’t teach it to read error bodies, sometimes because it skimmed the result for keywords and missed the error code. The agent never re-tries, never falls back, never tells the user. This is the failure mode that most often produces visible-on-Twitter agent fails.

// EXAMPLE TRACEstep_1: send_email(to=“team@x.com”, subj=“update”) → 500 smtp_timeout
agent_reply: “I’ve sent the email to the team.”
# email never went; agent confidently lied

// MITIGATIONTreat tool errors as a first-class part of the planning loop, not as exceptions to be ignored. The simplest version is a hard rule: "if the most recent tool result contains an error field or non-2xx status, you must address it before continuing." Pair with a small retry policy. We've seen recovery failures drop ~80% from this single change.

Five of the seven failure modes are reliably caught by the same pattern: a verifier that runs after every tool call and asks, “did the right thing actually happen?” The teams that ship this verifier ship more reliable agents. The teams that don’t, don’t.

// FROM A SR. AGENT ENGINEER · Q1 2026

// 06Plan abandonment

// Failure 06~9% OF INCIDENTS

The agent forgets the original goal mid-execution.

The user asked the agent to onboard a new customer (multi-step task). Mid-way through, a tool returns a related but distracting result — say, an unrelated open ticket from that customer. The agent pivots to address the ticket, completes it, and reports back as if the original task is done. The original onboarding goal has been quietly dropped. The agent’s working memory has been overwritten by the most salient recent thing.

// MITIGATIONAn explicit "pinned goal" structure that the planner re-reads before every step, not just the conversation history. A single line of state — "Original goal: onboard customer X. Steps complete: 2/5." — pinned at the top of the agent's context drastically reduces this. The teams we work with that do this report plan-abandonment incidents below 3%.

// 07Context collapse

// Failure 07~13% OF INCIDENTS

Context window fills with tool outputs. The user's intent is lost.

By step 25, the context window is full of prior tool outputs — large search results, file contents, transcripts. The user’s original request, 24,000 tokens ago, has been pushed off the front. The agent now answers what’s most salient, not what was asked. This is structurally distinct from plan abandonment: in plan abandonment the agent has the goal in context but ignores it; in context collapse the goal isn’t in context anymore.

// MITIGATIONPin a compact "user-intent" summary to the top of the context window, kept canonical across turns. Aggressively summarise old tool outputs once they're past their useful window — keep the result, drop the intermediate scaffolding. The systems that scale agent runs past 30 steps invariably do some version of this. Without it, the failure rate rises sharply with trajectory length.

// 08What we tell candidates about all this

If you’re applying to a senior agent-engineering role at a frontier lab in 2026, the panel doesn’t expect you to have invented this taxonomy — but they do expect you to be conversational with most of it. The candidates who land these roles can usually walk a panelist through three or four of the seven from memory, name the production system where they encountered each one, and describe what they shipped to mitigate it. The candidates who can’t tend to come from teams where they were one step removed from the agent’s incident logs. The panel can tell.

If your day-to-day doesn’t put you next to production agent failures, the cheapest way to close the gap is to ship a small agent yourself, ideally one that actually has to call real tools that have non-trivial error surfaces. You will hit at least four of the seven failure modes in your first weekend. 4 / 7 is more than enough to make the panel conversation real.

The flip side, for hiring panels: when a candidate names these failures, follow up specifically on mitigation rather than on definition. Anyone who reads agent-design Twitter knows the names. The candidates who have actually shipped this work give you specific patterns — a constrained-decoding setup, a pinned-goal structure, a per-tool budget — not vocabulary. The taxonomy is on the test; the patterns are the test.

Next week’s piece is a less technical one: when quitting your frontier lab is the right call. The signs frontier-lab engineers hit around year three and what the ones who left and did better did differently. Subscribe at the bottom of the blog page if you’d like it in your inbox.

The OpenTalent Network

// EDITORIAL · AGENTS & TOOLS

// Read next

// COMPANION

Agent-eval design is the new bottleneck role

The eight failure modes from the eval engineer’s side. Why this is the scarcest research role in 2026.

READ →// ROLE TRACK

Applied AI roles

Where production agent engineering lives. The eight specializations, the rubric, the comp.

READ →// PREP

Interview guides

Per-track playbooks calibrated to the OpenTalent screen.

READ →

If agent engineering is your column.

The labs hiring for this work share their listings with the network first. Apply to the network.

Apply to the network See the track

// THE TAXONOMY · AT A GLANCE

Schema drift

Tool signature changed; the call no longer parses.

~18%

Argument hallucination

Agent fabricates plausible but wrong arguments.

~22%

Looping

Same tool called repeatedly with tiny variations.

~14%

Premature termination

Agent declares success after one step; user wanted three.

~11%

Recovery failure

A tool errored. Agent didn’t notice or didn’t try again.

~13%

Plan abandonment

Agent forgets the original goal mid-execution.

~9%

Context collapse

Context window fills; the user’s intent is lost.

~13%

// 01Schema drift

// Failure 01~18% OF INCIDENTS

The tool signature changed. The agent's call no longer parses.

// 02Argument hallucination

// Failure 02~22% OF INCIDENTS

The agent fabricates plausible-looking but invalid arguments.

// 03Looping

// Failure 03~14% OF INCIDENTS

The agent calls the same tool repeatedly with tiny variations.

// 04Premature termination

// Failure 04~11% OF INCIDENTS

The agent declares success after one tool call when the user wanted three.

// 05Recovery failure

// Failure 05~13% OF INCIDENTS

A tool errored. The agent didn't notice — or noticed and gave up.

// EXAMPLE TRACEstep_1: send_email(to=“team@x.com”, subj=“update”) → 500 smtp_timeout
agent_reply: “I’ve sent the email to the team.”
# email never went; agent confidently lied

Five of the seven failure modes are reliably caught by the same pattern: a verifier that runs after every tool call and asks, “did the right thing actually happen?” The teams that ship this verifier ship more reliable agents. The teams that don’t, don’t.

// FROM A SR. AGENT ENGINEER · Q1 2026

// 06Plan abandonment

// Failure 06~9% OF INCIDENTS

The agent forgets the original goal mid-execution.

// 07Context collapse

// Failure 07~13% OF INCIDENTS

Context window fills with tool outputs. The user's intent is lost.

// 08What we tell candidates about all this

The OpenTalent Network

// EDITORIAL · AGENTS & TOOLS

// Read next

// COMPANION

If agent engineering is your column.

The labs hiring for this work share their listings with the network first. Apply to the network.

Apply to the network See the track