Error Recovery for Claude Agents: Budgets, Repeat Detection, and Stuck Signals

A Claude agent that works perfectly in a demo will, in production, occasionally make 217 tool calls in a single task. It will call the same tool with the same arguments four times in a row, get the same error, and keep going. It will burn through your daily token budget on a single user request because nothing told it to stop. This is the failure mode that kills agent products: not a single dramatic crash, but a quiet spiral where each step looks locally reasonable and the global outcome is catastrophic.

Four safeguards stop the spiral. A per-task tool budget with a hard cutoff. Repeat-call detection that forces a reflection step. Error classification that distinguishes transient failures from permanent ones. An explicit "I'm stuck" tool the agent can call to ask for human input. None of these are clever. All of them are missing from most agent implementations I have read.

This article instruments a working agent against a deliberately broken tool and measures what each safeguard buys you. The numbers are stark enough that you should never run an agent loop in production without all four.

The 200-call spiral, with numbers

I built a small test harness that wraps an Anthropic tool-use loop and points it at a fake API client. The fake client fails in three ways: a 30% rate of network timeouts (transient), a 10% rate of malformed JSON responses (looks like a parse error), and a 5% rate of 429 rate-limit errors (transient with backoff hint). The task is simple: fetch a user record, transform it, write the result.

Without any safeguards, here is what 50 runs of the same task produced:

Median tool calls: 12
95th-percentile tool calls: 89
Worst run: 217 tool calls before I killed the process manually
Task success rate: 64%
Median input tokens consumed: 18,400
95th-percentile input tokens: 142,000

With all four safeguards enabled, the same 50 runs:

Median tool calls: 8
95th-percentile tool calls: 14
Worst run: 22 calls (then escalated via the stuck signal)
Task success rate: 92%
Median input tokens: 11,200
95th-percentile input tokens: 19,800

The 95th-percentile token consumption dropped by 7x. The worst-case dropped by an order of magnitude. Success rate jumped 28 points. None of this required a smarter model. The same Claude Sonnet 4.5 ran both sets.

Safeguard 1: a hard tool budget

The simplest safeguard is a counter. Every tool call increments it. When the counter hits a ceiling, the loop terminates and the agent receives a final synthetic message: "Tool budget exhausted. Summarize what you have learned and return a partial answer."

Set the budget per task class, not globally. A "read a file and summarize" task gets 5 calls. A "research a topic across 8 sources" task gets 40. The ceiling exists because Claude will sometimes decide that one more search is worth it, and one more, and one more. Without a counter the loop has no notion of when "enough is enough" beyond the model's own judgment, which is exactly what is failing in the spiral case.

from anthropic import Anthropic
from dataclasses import dataclass, field

@dataclass
class BudgetedAgentLoop:
    client: Anthropic
    model: str = "claude-sonnet-4-5"
    max_tool_calls: int = 15
    tool_calls_used: int = 0
    messages: list = field(default_factory=list)

    def run(self, user_prompt: str, tools: list, tool_executor):
        self.messages = [{"role": "user", "content": user_prompt}]
        while True:
            if self.tool_calls_used >= self.max_tool_calls:
                self.messages.append({
                    "role": "user",
                    "content": (
                        f"Tool budget exhausted ({self.max_tool_calls} calls). "
                        "Summarize what you have learned and return a final answer."
                    ),
                })
                response = self.client.messages.create(
                    model=self.model,
                    max_tokens=2048,
                    messages=self.messages,
                )
                return self._final_text(response)

            response = self.client.messages.create(
                model=self.model,
                max_tokens=2048,
                tools=tools,
                messages=self.messages,
            )
            if response.stop_reason == "end_turn":
                return self._final_text(response)

            tool_results = self._execute_tools(response, tool_executor)
            self.messages.append({"role": "assistant", "content": response.content})
            self.messages.append({"role": "user", "content": tool_results})
            self.tool_calls_used += len(tool_results)

Two design choices matter. First, the budget check happens before the next API call, not after. This means the model gets one final turn to produce a coherent answer rather than the loop dying mid-tool-use. Second, the synthetic "budget exhausted" message uses neutral language. Words like "failure" or "error" prime the model to apologize and refuse; "summarize what you have learned" prompts it to be useful with partial information.

Safeguard 2: repeat-call detection

Claude will occasionally re-call the exact same tool with the exact same arguments twice in a row. This almost always indicates the agent is stuck and has not noticed. The fix is detection plus a forced reflection step.

The detection is a fingerprint of (tool_name, sorted_arguments_json). Store the last two fingerprints. If the new call matches either, you have a repeat. When you detect one, do not pass the tool result back as normal. Instead, inject a reflection prompt:

import hashlib
import json

def fingerprint_tool_call(tool_use_block) -> str:
    payload = {
        "name": tool_use_block.name,
        "input": tool_use_block.input,
    }
    canonical = json.dumps(payload, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

def detect_repeat(history: list[str], new_fp: str) -> bool:
    return new_fp in history[-2:]

# Inside the loop:
new_fp = fingerprint_tool_call(tool_use)
if detect_repeat(self.fingerprint_history, new_fp):
    reflection = (
        "You just called the same tool with identical arguments. "
        "The result will be the same. "
        "Stop, reflect on why this is not working, and choose a different approach: "
        "different arguments, a different tool, or signal that you are stuck."
    )
    self.messages.append({"role": "user", "content": reflection})
    continue  # Skip executing the duplicate call
self.fingerprint_history.append(new_fp)

In the instrumented run, 31% of the spirals over 50 calls contained at least one repeat-pair. Catching them early and forcing a reflection prevented 9 out of 11 of those spirals from continuing past 20 calls.

A subtle point: the threshold is "same fingerprint in the last 2 calls," not "ever." Agents legitimately re-fetch the same file later in a task to confirm a change took effect. The two-call window catches the pathological case (immediate repeat) without false-positiving on legitimate re-reads.

Safeguard 3: transient vs permanent errors

Not every error deserves a retry. A 429 rate-limit with a Retry-After header is transient; backing off and retrying is correct. A 401 unauthorized is permanent; retrying is pointless and wastes budget. A 400 with a malformed-payload message is permanent; the agent needs to fix its arguments, not retry the same broken call.

The agent itself cannot reliably make this distinction from a raw error string. Your tool executor must classify the error before handing it back. The contract for a tool result becomes a small envelope:

@dataclass
class ToolResult:
    content: str
    error_class: str | None = None  # None, "transient", "permanent"
    retry_after_seconds: float | None = None

def execute_with_classification(tool_name: str, args: dict) -> ToolResult:
    try:
        return ToolResult(content=do_call(tool_name, args))
    except RateLimitError as e:
        return ToolResult(
            content=f"Rate limited. Retry after {e.retry_after}s.",
            error_class="transient",
            retry_after_seconds=e.retry_after,
        )
    except (AuthError, ValidationError) as e:
        return ToolResult(
            content=f"Permanent failure: {e}. Do not retry with the same arguments.",
            error_class="permanent",
        )
    except NetworkTimeout:
        return ToolResult(
            content="Network timeout. May succeed if retried.",
            error_class="transient",
        )

When you serialize a ToolResult into the message payload, include the classification in the visible text. Claude is excellent at acting on natural-language hints. "Do not retry with the same arguments" in the tool result text reduces same-argument retries by roughly 70% in my measurements, compared to a bare error string like HTTP 400: Bad Request.

Pair this with one more rule on the executor side: for transient errors, retry up to 3 times automatically before returning to the model, with exponential backoff (1s, 2s, 4s). The model should never see a flake-recoverable failure as a tool result. By the time the model sees a "transient" classification, the executor has already given up on quick recovery.

Safeguard 4: an explicit "stuck" tool

Even with budgets and repeat detection, an agent can plateau: not spiraling, but not progressing. The cleanest exit is to give the agent a tool it can call when it has tried twice and hit a wall.

stuck_tool = {
    "name": "request_human_help",
    "description": (
        "Call this when you have tried twice and cannot make progress. "
        "Provide a concise summary of what you tried and what is blocking. "
        "A human operator will review and respond. Calling this is preferable "
        "to making more guesses or repeated tool calls."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "summary": {"type": "string"},
            "attempted_approaches": {"type": "array", "items": {"type": "string"}},
            "specific_question": {"type": "string"},
        },
        "required": ["summary", "specific_question"],
    },
}

The tool description matters more than the schema. Phrases like "preferable to making more guesses" give the model permission to stop. Without that explicit permission, models tend to keep trying because the prompt implicitly demands an answer.

When this tool fires, the loop should immediately terminate and surface the structured payload to whatever queue or notification channel your operator uses. Do not feed the call result back into the model; the next step is a human, not another inference round.

A fault-injection test harness

You can build a measurement harness in roughly 100 lines. The shape:

import random
from contextlib import contextmanager

class FlakyToolWrapper:
    def __init__(self, real_tool, timeout_rate=0.3, malformed_rate=0.1, rate_limit_rate=0.05):
        self.real_tool = real_tool
        self.timeout_rate = timeout_rate
        self.malformed_rate = malformed_rate
        self.rate_limit_rate = rate_limit_rate
        self.call_count = 0

    def call(self, args):
        self.call_count += 1
        r = random.random()
        if r < self.timeout_rate:
            raise NetworkTimeout("simulated timeout")
        if r < self.timeout_rate + self.rate_limit_rate:
            raise RateLimitError(retry_after=2.0)
        if r < self.timeout_rate + self.rate_limit_rate + self.malformed_rate:
            return "{not valid json"
        return self.real_tool(args)

def run_benchmark(safeguards_enabled: bool, runs: int = 50):
    results = []
    for _ in range(runs):
        loop = BudgetedAgentLoop(
            client=Anthropic(),
            max_tool_calls=15 if safeguards_enabled else 250,
        )
        tools = wrap_with_flakes(real_tools)
        try:
            result = loop.run(task_prompt, tools, tool_executor)
            success = validate_result(result)
        except BudgetExceeded:
            success = False
        results.append({
            "tool_calls": loop.tool_calls_used,
            "success": success,
            "tokens_in": loop.input_tokens_used,
        })
    return results

Run the harness with safeguards off and on, then plot the call-count distributions side by side. The shape of the "off" distribution has a long fat tail (those 200+ runs). The "on" distribution clusters tightly. If your team is debating whether the safeguards are worth the engineering cost, the visual difference ends the debate in about 30 seconds.

Production wiring notes

A few details that matter once you take this past a benchmark:

Log every fingerprint, every budget decision, and every classification at INFO level with a task_id correlation key. When an operator asks "why did this run cost $14," you need the trace. The Anthropic tool-use overview (https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview) covers the request shape; instrumentation is on you.
Make the budget configurable per task type but enforce a global ceiling at the gateway layer too. A misconfigured downstream cannot bankrupt you if the orchestrator caps total calls per session at, say, 50.
The request_human_help payload should land in whatever channel the operator is already watching: a Slack webhook, a Telegram bot, a dashboard row. The wrong place is "the agent's logs" — nobody reads those.
When you add new tools, audit their error surfaces. A new tool that surfaces every failure as a generic RuntimeError defeats the classifier. The Anthropic SDK source (https://github.com/anthropics/anthropic-sdk-python) is worth reading to see how the client itself classifies its own errors.

The throughline of all four safeguards: an agent left to its own judgment will, sometimes, judge wrong in ways that compound. Each safeguard is a place where the harness, not the model, gets the final word. That asymmetry is the entire point of the loop being yours to write rather than the SDK's to provide.

References: