Error Propagation in Multi-Agent Systems

Estimated time: 25 minutes

When a subagent fails in a multi-agent system, the coordinator's recovery quality depends on how rich the error context is. A bare "subagent failed" gives the coordinator nothing. Structured error context with categories, partial results, and alternative approaches enables intelligent recovery.

The Rich Error Pattern

return {
  isError: true,
  errorCategory: "transient",      // transient | validation | permission | business
  isRetryable: true,
  failureType: "timeout",
  attemptedQuery: "…",
  partialResults: [/* what we got before failing */],
  alternativeApproaches: [
    "try a broader query",
    "use document analysis instead of web search"
  ],
  description: "Web search timed out after 30s. Partial results from cached index included."
};

Why Each Field Matters

  • errorCategory. The coordinator decides recovery based on category. Transient → retry. Validation → fix input. Permission → escalate. Business → don't retry; alternate path.
  • isRetryable. Explicit retry guidance. Removes ambiguity from category interpretation.
  • partialResults. Even a partially-completed search is more useful than nothing. The coordinator can use what was returned.
  • alternativeApproaches. Nudges the coordinator toward valid recovery paths. Without this, the coordinator may not consider all options.

Anti-Pattern 1: Silent Empty Result Suppression

If a subagent fails and returns an empty result without flagging the error, the coordinator believes the subtask succeeded with no findings. It synthesizes the response as if nothing was wrong. The user gets an answer that omits critical information without any indication that something failed.

Anti-Pattern 2: Whole-Workflow Termination

If one subagent fails and the entire workflow terminates, the user gets a hard error for what may have been a recoverable issue. Other subagents that succeeded are wasted. The right behavior is to propagate the error to the coordinator with enough context for recovery — let the coordinator decide whether the workflow can continue.

Anti-Pattern 3: Generic "Search Unavailable"

The coordinator sees "search unavailable" and has no way to distinguish a 30-second timeout (retry) from a permanent service failure (route around) from a malformed query (fix input). Generic errors collapse the recovery decision space.

Coordinator Recovery Patterns

Given a structured error, the coordinator can:

  • Retry — for errorCategory: transient with isRetryable: true, with backoff.
  • Fix and retry — for validation errors, fix the input based on the failure description.
  • Try alternative approach — pick from alternativeApproaches.
  • Use partial results — incorporate partialResults into synthesis even if full results unavailable.
  • Escalate — for permission errors that the coordinator can't resolve.

Subagent Local vs Propagated Failures

Subagents handle transient failures locally — retry once or twice with backoff before propagating. Only failures the subagent cannot resolve get propagated to the coordinator. This keeps the coordinator's error handling focused on real problems, not noise.

Skills to Develop

  1. Return structured error context with category, retry flag, partial results, and alternative approaches.
  2. Avoid silently suppressing errors as empty results.
  3. Avoid whole-workflow termination on single subagent failure.
  4. Have subagents handle transient errors locally; propagate only unresolvable failures.
Exam tip (Q8): Web search subagent times out → return structured error context with retry guidance and alternatives. Not generic "search unavailable", not silent empty result, not workflow termination.