Code Intermediate medium · 6 min

Tool errors: handling failure gracefully

What you will learn

Catch and recover from tool execution failures instead of crashing your agent loop.

Why this matters

In production, external tools fail: APIs timeout, services go down, invalid arguments slip through. Without proper error handling, your agent stops cold. Graceful failure lets agents retry, use fallback tools, or ask for clarification instead of returning a 500 error to your user.

Skip if: You don't need explicit tool error handling if you're building a simple one-shot chain (prompt → LLM → parse output, no tool calls). Also unnecessary if you're in a learning/sandbox environment where crashes are acceptable.

Explanation

Tool errors are exceptions raised when a tool executes: bad API responses, network timeouts, validation failures, or bad arguments. In LangChain, these errors bubble up and break your agentic loop unless you catch and handle them. Graceful handling means intercepting that error, logging it, and either retrying the tool, providing a fallback response, or asking the LLM to adjust its next move. LangChain's ToolException is designed exactly for this: you raise it inside a tool, and the agent framework catches it, formats it as a tool error message, and feeds it back to the LLM for recovery. The agent sees the error as data (not a crash) and can reason about what went wrong. This keeps the loop alive and lets the LLM self-correct: call a different tool, refine arguments, or report the failure to the user intentionally.

Analogy

Think of a waiter at a restaurant. If a customer orders a dish and the kitchen says 'we're out of that ingredient,' the waiter doesn't panic and walk out. Instead, they tell the customer the problem and ask 'Would you like the salmon instead?' The agent is like that waiter: when a tool fails, it doesn't crash; it receives the error message and decides the next best action.

Code

python

from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.exceptions import ToolException
import json

@tool
def fetch_user_data(user_id: int) -> dict:
    """Fetch user information by ID. Raises ToolException on failure."""
    if user_id < 1:
        raise ToolException(f"Invalid user_id: {user_id}. Must be >= 1.")
    if user_id == 999:
        raise ToolException("Service unavailable: user database is down.")
    return {"user_id": user_id, "name": f"User_{user_id}", "email": f"user{user_id}@example.com"}

@tool
def get_fallback_user() -> dict:
    """Return a default user when primary lookup fails."""
    return {"user_id": 0, "name": "Guest", "email": "guest@example.com"}

def handle_tool_error(error: ToolException) -> str:
    """Convert tool error to a message the LLM can read and act on."""
    return f"Tool failed with error: {str(error)}. Consider using fallback_user tool or ask user for valid input."

tools = [fetch_user_data, get_fallback_user]
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_with_tools = llm.bind_tools(tools)

prompt = ChatPromptTemplate.from_messages([
    ("user", "Fetch user data for user_id {user_id}. If that fails, use fallback_user and return the result.")
])

chain = prompt | llm_with_tools

print("=== Test 1: Valid user_id ===")
try:
    result = chain.invoke({"user_id": 42})
    if hasattr(result, 'tool_calls'):
        for tool_call in result.tool_calls:
            print(f"Tool call: {tool_call['name']}({tool_call['args']})")
            tool_obj = next(t for t in tools if t.name == tool_call['name'])
            try:
                output = tool_obj.invoke(tool_call['args'])
                print(f"Tool output: {json.dumps(output, indent=2)}")
            except ToolException as e:
                print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
    print(f"Chain error: {e}")

print("\n=== Test 2: Invalid user_id (negative) ===")
try:
    result = chain.invoke({"user_id": -5})
    if hasattr(result, 'tool_calls'):
        for tool_call in result.tool_calls:
            print(f"Tool call: {tool_call['name']}({tool_call['args']})")
            tool_obj = next(t for t in tools if t.name == tool_call['name'])
            try:
                output = tool_obj.invoke(tool_call['args'])
                print(f"Tool output: {json.dumps(output, indent=2)}")
            except ToolException as e:
                print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
    print(f"Chain error: {e}")

print("\n=== Test 3: Service failure (user_id=999) ===")
try:
    result = chain.invoke({"user_id": 999})
    if hasattr(result, 'tool_calls'):
        for tool_call in result.tool_calls:
            print(f"Tool call: {tool_call['name']}({tool_call['args']})")
            tool_obj = next(t for t in tools if t.name == tool_call['name'])
            try:
                output = tool_obj.invoke(tool_call['args'])
                print(f"Tool output: {json.dumps(output, indent=2)}")
            except ToolException as e:
                print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
    print(f"Chain error: {e}")

Output

=== Test 1: Valid user_id ===
Tool call: fetch_user_data({'user_id': 42})
Tool output: {
  "user_id": 42,
  "name": "User_42",
  "email": "user42@example.com"
}

=== Test 2: Invalid user_id (negative) ===
Tool call: fetch_user_data({'user_id': -5})
Tool error caught: Tool failed with error: Invalid user_id: -5. Must be >= 1.. Consider using fallback_user tool or ask user for valid input.

=== Test 3: Service failure (user_id=999) ===
Tool call: fetch_user_data({'user_id': 999})
Tool error caught: Tool failed with error: Service unavailable: user database is down.. Consider using fallback_user tool or ask user for valid input.

What just happened?

The code defined two tools: <code>fetch_user_data</code> which raises <code>ToolException</code> on invalid or unavailable data, and <code>get_fallback_user</code> as a recovery option. For each test case, we invoked the chain with different user_ids. When the tool raised a <code>ToolException</code>, we caught it with a try/except block and converted it to a human-readable message that the LLM could read and act on. Valid inputs returned data; invalid or service-failure inputs triggered the error handler, producing a message the agent could use to reason about recovery.

Common gotcha

Developers often raise generic Python exceptions (ValueError, RuntimeError) inside tools and expect the agent to handle them. These crash the agent loop instead of being caught. You must raise ToolException explicitly, or wrap other exceptions and re-raise as ToolException, for the framework to catch and feed the error back to the LLM. Also: forgetting that ToolException messages are strings that the LLM reads: they must be clear and actionable, not cryptic error codes.

Error recovery

AttributeError: 'tool_call' object has no attribute 'name'

You're accessing tool_call incorrectly. In modern LangChain (1.2.x), tool_calls from the model are dicts with keys 'name', 'args', 'id'. Access as tool_call['name'], not tool_call.name.

ToolException not caught, agent crashes

You raised a regular Python exception instead of ToolException. Change `raise ValueError(msg)` to `from langchain_core.exceptions import ToolException; raise ToolException(msg)`.

LLM sees the error but doesn't retry

The LLM may lack context or may not have access to fallback tools. Add explicit guidance in the prompt and ensure all recovery tools are bound to the model with bind_tools(). Also consider using agentic loops (LangGraph) for automatic retry logic.

Tool error message too long, causes token bloat

Truncate or summarize the error message. Instead of the full stack trace, pass `raise ToolException('API timeout: retry with backoff')`. Keep errors under 100 characters when possible.

Experienced dev note

In production, the difference between raising ToolException and a regular exception is the difference between a graceful degradation and a 500 error. But here's what most devs miss: even with ToolException, the LLM needs to know *how* to recover. If you only raise the error without giving it access to a fallback tool or retry logic, the LLM will just apologize to the user. Always pair error handling with alternative tools or explicit recovery instructions in the prompt. Also, log all ToolExceptions: not in the LLM message, but to your observability backend. Errors are data; they tell you which services are flaky.

Check your understanding

If a tool raises a ToolException with the message 'API rate limit exceeded', why does feeding that message back to the LLM not automatically solve the problem? What would the LLM need to actually recover?

Show answer hint

A correct answer recognizes that the LLM reads the error as text data, not as a trigger for built-in retry logic. Recovery requires either (1) an explicit retry tool that the LLM can call, (2) a fallback tool, or (3) a human-in-the-loop prompt. The error message alone doesn't fix the rate limit; the LLM needs *options* to act on.

VERSION In langchain < 0.3.0, tool error handling used catch_tool_exceptions context manager. As of langchain-core 0.3.x and langchain 1.2.x, ToolException is the standard. Older code using string-based tool error injection is deprecated: use ToolException or bind_tools with structured error propagation.

Once you've mastered error handling for single tools, learn how to orchestrate multiple tool calls and manage state across agent steps using LangGraph's StateGraph for more robust multi-step workflows.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.