Tools fail. Networks time out, inputs are malformed, business rules block operations. How a tool communicates failure determines whether the agent recovers gracefully or goes off the rails. This lesson covers MCP's isError contract, error categories, and what subagents should propagate vs handle locally.
The isError Flag
MCP tools communicate failure via the isError boolean in the response. When isError: true, Claude knows the call did not succeed and can decide how to recover. A generic "Operation failed" message in this case is useless — Claude can't distinguish a transient timeout (retry) from a permission failure (escalate) from a validation error (fix the input).
Error Categories
| Category | Example | Right action |
|---|---|---|
| Transient | Network timeout, rate limit | Retry with backoff |
| Validation | Bad input format, missing required field | Fix the input and retry |
| Business | Refund > policy limit, customer not verified | Escalate or follow alternate path; do NOT retry |
| Permission | Auth token expired, no scope for action | Escalate to human; do NOT retry blindly |
The Structured Error Response Pattern
{
"isError": true,
"errorCategory": "transient",
"isRetryable": true,
"failureType": "timeout",
"attemptedQuery": "…",
"partialResults": [],
"alternativeApproaches": ["try a broader query", "use document analysis"],
"description": "Search timed out after 30s. Retry recommended."
}
The isRetryable flag explicitly encodes whether the model should try again. Including partialResults lets the model use what was returned even when the operation didn't complete fully. alternativeApproaches nudges the model toward valid recovery paths it might not have considered.
Access Failure vs Empty Result
This is a common confusion. A search tool that times out is an access failure — the data exists but couldn't be retrieved. A search tool that returns zero matches is a valid empty result — the operation succeeded, the answer is "nothing matched." These need different responses:
- Access failure:
isError: true,errorCategory: "transient",isRetryable: true. - Empty result:
isError: false,results: []. Not an error.
Subagent Error Propagation
Subagents should handle transient failures locally — if a search times out, retry once or twice with backoff. Only propagate errors to the coordinator that the subagent cannot resolve. When propagating, include partial results: even a half-completed analysis is more valuable to the coordinator than "subagent failed."
Skills to Develop
- Categorize tool errors into transient, validation, business, and permission types.
- Write error responses that include
isRetryable, partial results, and alternative approaches. - Distinguish access failure from valid empty results.
- Have subagents resolve transient errors locally; propagate only unresolvable failures.