Structured Error Responses

Estimated time: 25 minutes

Tools fail. Networks time out, inputs are malformed, business rules block operations. How a tool communicates failure determines whether the agent recovers gracefully or goes off the rails. This lesson covers MCP's isError contract, error categories, and what subagents should propagate vs handle locally.

The isError Flag

MCP tools communicate failure via the isError boolean in the response. When isError: true, Claude knows the call did not succeed and can decide how to recover. A generic "Operation failed" message in this case is useless — Claude can't distinguish a transient timeout (retry) from a permission failure (escalate) from a validation error (fix the input).

Error Categories

CategoryExampleRight action
TransientNetwork timeout, rate limitRetry with backoff
ValidationBad input format, missing required fieldFix the input and retry
BusinessRefund > policy limit, customer not verifiedEscalate or follow alternate path; do NOT retry
PermissionAuth token expired, no scope for actionEscalate to human; do NOT retry blindly

The Structured Error Response Pattern

{
  "isError": true,
  "errorCategory": "transient",
  "isRetryable": true,
  "failureType": "timeout",
  "attemptedQuery": "…",
  "partialResults": [],
  "alternativeApproaches": ["try a broader query", "use document analysis"],
  "description": "Search timed out after 30s. Retry recommended."
}

The isRetryable flag explicitly encodes whether the model should try again. Including partialResults lets the model use what was returned even when the operation didn't complete fully. alternativeApproaches nudges the model toward valid recovery paths it might not have considered.

Access Failure vs Empty Result

This is a common confusion. A search tool that times out is an access failure — the data exists but couldn't be retrieved. A search tool that returns zero matches is a valid empty result — the operation succeeded, the answer is "nothing matched." These need different responses:

  • Access failure: isError: true, errorCategory: "transient", isRetryable: true.
  • Empty result: isError: false, results: []. Not an error.

Subagent Error Propagation

Subagents should handle transient failures locally — if a search times out, retry once or twice with backoff. Only propagate errors to the coordinator that the subagent cannot resolve. When propagating, include partial results: even a half-completed analysis is more valuable to the coordinator than "subagent failed."

Skills to Develop

  1. Categorize tool errors into transient, validation, business, and permission types.
  2. Write error responses that include isRetryable, partial results, and alternative approaches.
  3. Distinguish access failure from valid empty results.
  4. Have subagents resolve transient errors locally; propagate only unresolvable failures.
Exam tip (Q8): A web search subagent times out. The right response is structured error context with retry guidance and alternative approaches — not generic "search unavailable", not silent empty result, and definitely not terminating the entire workflow.