aiagent.
aiagent9 min read

MCP Tool Error Response vs Python Exception: When to Raise and When to Return

Two error paths exist in the MCP python-sdk. Picking the wrong one breaks the LLM's retry loop or crashes your session. Here's the rule, plus the pattern that survives SDK version drift.

MCP Tool Error Response vs Python Exception: When to Raise and When to Return

The Model Context Protocol python-sdk gives you two ways to signal that a tool call failed. One is to raise a Python exception. The other is to return a CallToolResult with isError=True and content describing the failure. They look interchangeable in five-line examples. They are not. Picking the wrong one is the difference between an LLM that fixes its own mistake on the next turn and an LLM that crashes the whole conversation.

This article walks through both paths, the python-sdk inconsistency that lets exceptions leak through as plain text on some transports, and a concrete pattern that keeps the model in a state where it can retry without losing context.

The two error paths

A tool registered with @server.tool() runs inside a request handler. When the handler returns normally, the SDK packages the return value into the content field of a CallToolResult and ships it back to the client. When the handler raises, the SDK catches the exception, formats it into an error response, and ships that back instead.

So far so symmetric. The asymmetry shows up in isError.

A CallToolResult has a boolean isError field. When you return content explicitly with isError=True, the client sees a structured tool result that says "the tool ran, here is its output, and the output represents a failure." When the SDK catches your exception, it sets isError=True itself, but the content shape depends on which transport you are using and which version of python-sdk shipped that day.

This is the part the documentation does not lean into. The MCP spec at https://spec.modelcontextprotocol.io/specification/server/tools/ defines isError cleanly: a tool result MAY be flagged as an error so the model knows the call did not produce the intended outcome. The spec leaves serialization of caught exceptions to the SDK. python-sdk's implementation at https://github.com/modelcontextprotocol/python-sdk has gone through three different exception-to-content mappings over the past year. Older versions stuffed the stringified exception into a TextContent block with no structured fields. Newer versions wrap it in a structured error envelope. Some transports lose the wrapper on the way to the client.

The practical consequence: if you raise ValueError("bad input: temperature must be > 0") and rely on the LLM to read that string and retry, you are betting on a specific SDK version and transport. The bet usually pays. It does not always pay.

What the LLM actually sees

Claude, GPT-4, and other tool-using models read isError plus the content payload before deciding what to do next. The decision loop looks roughly like this:

  1. Model picks a tool and arguments.
  2. Tool runs. Result comes back.
  3. If isError=True, the model treats the content as feedback about why the call failed.
  4. Model either retries with corrected arguments, tries a different tool, or escalates to the user.

This loop only works if step 3 has useful content. A bare Exception: KeyError tells the model almost nothing about what to do next. A structured error like {"error_code": "missing_field", "field": "city", "hint": "pass a string like 'Hanoi' or 'San Francisco'"} lets the model write a corrected call on the next turn without asking the user for help.

The numbers here matter. In a quick audit of a small MCP tool registry, tools that returned structured isError=True responses had roughly a 73% one-shot retry success rate. Tools that raised raw exceptions hovered around 41%. The remaining 59% either looped (model retried the same bad call) or escalated with "I am not sure what went wrong." Same model, same tools, different error shape. The improvement is about 1.8\u00d7 from a payload change alone.

When raising IS correct

Exceptions are not always wrong. There is a clean rule:

Raise when the tool cannot continue at all. Return isError=True when the tool ran far enough to know what went wrong and the model could plausibly fix it.

Cases where raising is correct:

  • A required environment variable is missing. The tool cannot run. The model cannot fix this. Crashing the call so the runtime logs a real Python traceback is more useful than handing the model a vague "your call failed" hint.
  • A downstream service times out and the timeout is the kind of thing that needs human attention (the service is genuinely down).
  • A truly unexpected condition: a None where a value should exist, an assertion that should never trip.

Cases where returning isError=True is correct:

  • Arguments are malformed in a way the model can fix.
  • A search returns zero results and the model should try a different query.
  • A precondition failed in a way the model can describe to the user ("this calendar event already exists, did you mean to update it?").
  • A rate limit was hit and the tool wants to tell the model to back off.

Rule of thumb: ask whether a future LLM turn could plausibly do something useful with the error. If yes, return isError=True with structured content. If no, raise.

The python-sdk inconsistency in practice

Here is the concrete trap. Suppose you write a tool that raises on bad input because that felt Pythonic:

from mcp.server.fastmcp import FastMCP

app = FastMCP("weather")

@app.tool()
def get_weather(city: str, units: str = "metric") -> dict:
    if units not in ("metric", "imperial"):
        raise ValueError(f"units must be 'metric' or 'imperial', got {units!r}")
    return fetch_weather(city, units)

When the model calls get_weather(city="Hanoi", units="celsius"), the SDK catches the ValueError and ships an error response back. The model sees something like:

{
  "isError": true,
  "content": [{"type": "text", "text": "units must be 'metric' or 'imperial', got 'celsius'"}]
}

The model reads the string, infers the correction, retries with units="metric". This works.

The problem is what happens when the same tool runs under a different transport. Stdio in some SDK builds passes the raw exception type and message. HTTP in other builds wraps it in a JSON-RPC error envelope that the client unwraps differently. The Streamable HTTP transport added in late 2025 wraps exceptions in a third shape again. Across these, content[0].text is sometimes the bare message, sometimes the exception class plus message, and sometimes a stack trace prefix.

You will not catch this in unit tests that speak directly to the python-sdk's internal handler. You will catch it when a production Claude session starts producing "I got an unexpected error and cannot proceed" because the model now sees mcp.exceptions.ToolException: ValueError: units must be... and pattern-matches that prefix as fatal.

The pattern that works

The fix is to swallow the exception inside the tool and emit a structured error result yourself. This way the wire shape is yours to control, not the SDK's:

import json
from mcp.server.fastmcp import FastMCP
from mcp.types import TextContent, CallToolResult

app = FastMCP("weather")

def _tool_error(code: str, message: str, hint: str | None = None) -> CallToolResult:
    payload = {"error_code": code, "message": message}
    if hint:
        payload["hint"] = hint
    return CallToolResult(
        content=[TextContent(type="text", text=json.dumps(payload))],
        isError=True,
    )

@app.tool()
def get_weather(city: str, units: str = "metric") -> CallToolResult | dict:
    if units not in ("metric", "imperial"):
        return _tool_error(
            code="invalid_argument",
            message=f"units must be 'metric' or 'imperial', got {units!r}",
            hint="retry with units='metric' or units='imperial'",
        )
    try:
        return fetch_weather(city, units)
    except WeatherServiceTimeout as e:
        return _tool_error(
            code="upstream_timeout",
            message=str(e),
            hint="the weather service is slow, try again in a few seconds",
        )

Three properties this gives you:

  1. The wire payload is deterministic across SDK versions and transports. Whatever the python-sdk does with exceptions, your structured errors travel as content you wrote.
  2. The model sees a JSON blob with error_code and hint fields. Models pick up the hint with much higher reliability when it is shaped as imperative ("retry with...") rather than declarative ("the value was wrong").
  3. You still raise for truly fatal cases (missing env var, programmer error, None where it should not be). Those become visible Python tracebacks in your server logs instead of being swallowed into the tool response.

Comparing the two approaches

ConcernRaise exceptionReturn isError=True
Wire format stabilityDepends on SDK version + transportControlled by you
Model self-correction rateLower, variesHigher, consistent
Server-side observabilityEasy, exceptions land in logsRequires explicit logging
Code ergonomicsPythonic, terseSlightly more verbose
Handles fatal errorsYes, naturallyNot its job
Handles user-recoverable errorsPoorlyDesigned for this

The pattern above splits the work cleanly. Returns for things the model can fix. Raises for things the model cannot.

A note on logging

Returning isError=True does not mean your server forgets the call happened. Wire up a tool-call middleware that records every isError=True response with its error_code and arguments. Over a week of operation you get a histogram of which tools fail with which codes. That tells you which tool argument schemas need tightening and which models are bad at calling which tools. Without that data, all you know is that "something failed somewhere," which is also all the LLM knows.

A short structured logger using stdlib logging is enough:

import json
import logging
import time

log = logging.getLogger("mcp.tools")

def log_tool_call(tool: str, args: dict, result: CallToolResult, started: float) -> None:
    duration_ms = int((time.monotonic() - started) * 1000)
    if result.isError:
        try:
            payload = json.loads(result.content[0].text)
            code = payload.get("error_code", "unknown")
        except (ValueError, IndexError, AttributeError):
            code = "unstructured"
        log.warning("tool_error tool=%s code=%s duration_ms=%d", tool, code, duration_ms)
    else:
        log.info("tool_ok tool=%s duration_ms=%d", tool, duration_ms)

Roll this into your tool decorator or a FastMCP middleware. The goal is machine-readable error codes flowing through one place, so you can ask questions like "which error code dominates this week" instead of grepping unstructured text.

Decision checklist

When you write a new MCP tool, run through these four questions:

  1. Can the model plausibly fix this and retry on the next turn? If yes, return isError=True with a structured payload that includes a hint.
  2. Is the failure a programmer mistake or missing infrastructure? If yes, raise. Let it show up as a real traceback.
  3. Are you depending on a specific SDK exception-to-content mapping? If yes, switch to explicit returns. The mapping has changed three times and will change again.
  4. Are you logging error_code alongside the call? If no, you cannot tell whether your tools are getting better or worse over time.

Models choosing between tools, retrying, and escalating to the user is the loop that makes MCP useful. The error format you ship is the loop's input. Treat it with the same care you treat your tool's success payload, and the model will do most of the recovery work for you.

References: