When a subagent fails in a multi-agent system, the coordinator's recovery quality depends on how rich the error context is. A bare "subagent failed" gives the coordinator nothing. Structured error context with categories, partial results, and alternative approaches enables intelligent recovery.
The Rich Error Pattern
return {
isError: true,
errorCategory: "transient", // transient | validation | permission | business
isRetryable: true,
failureType: "timeout",
attemptedQuery: "…",
partialResults: [/* what we got before failing */],
alternativeApproaches: [
"try a broader query",
"use document analysis instead of web search"
],
description: "Web search timed out after 30s. Partial results from cached index included."
};
Why Each Field Matters
errorCategory. The coordinator decides recovery based on category. Transient → retry. Validation → fix input. Permission → escalate. Business → don't retry; alternate path.isRetryable. Explicit retry guidance. Removes ambiguity from category interpretation.partialResults. Even a partially-completed search is more useful than nothing. The coordinator can use what was returned.alternativeApproaches. Nudges the coordinator toward valid recovery paths. Without this, the coordinator may not consider all options.
Anti-Pattern 1: Silent Empty Result Suppression
If a subagent fails and returns an empty result without flagging the error, the coordinator believes the subtask succeeded with no findings. It synthesizes the response as if nothing was wrong. The user gets an answer that omits critical information without any indication that something failed.
Anti-Pattern 2: Whole-Workflow Termination
If one subagent fails and the entire workflow terminates, the user gets a hard error for what may have been a recoverable issue. Other subagents that succeeded are wasted. The right behavior is to propagate the error to the coordinator with enough context for recovery — let the coordinator decide whether the workflow can continue.
Anti-Pattern 3: Generic "Search Unavailable"
The coordinator sees "search unavailable" and has no way to distinguish a 30-second timeout (retry) from a permanent service failure (route around) from a malformed query (fix input). Generic errors collapse the recovery decision space.
Coordinator Recovery Patterns
Given a structured error, the coordinator can:
- Retry — for
errorCategory: transientwithisRetryable: true, with backoff. - Fix and retry — for
validationerrors, fix the input based on the failure description. - Try alternative approach — pick from
alternativeApproaches. - Use partial results — incorporate
partialResultsinto synthesis even if full results unavailable. - Escalate — for
permissionerrors that the coordinator can't resolve.
Subagent Local vs Propagated Failures
Subagents handle transient failures locally — retry once or twice with backoff before propagating. Only failures the subagent cannot resolve get propagated to the coordinator. This keeps the coordinator's error handling focused on real problems, not noise.
Skills to Develop
- Return structured error context with category, retry flag, partial results, and alternative approaches.
- Avoid silently suppressing errors as empty results.
- Avoid whole-workflow termination on single subagent failure.
- Have subagents handle transient errors locally; propagate only unresolvable failures.