Tool errors: handling failure gracefully
Why this matters
In production, external tools fail: APIs timeout, services go down, invalid arguments slip through. Without proper error handling, your agent stops cold. Graceful failure lets agents retry, use fallback tools, or ask for clarification instead of returning a 500 error to your user.
Explanation
Tool errors are exceptions raised when a tool executes: bad API responses, network timeouts, validation failures, or bad arguments. In LangChain, these errors bubble up and break your agentic loop unless you catch and handle them. Graceful handling means intercepting that error, logging it, and either retrying the tool, providing a fallback response, or asking the LLM to adjust its next move. LangChain's ToolException is designed exactly for this: you raise it inside a tool, and the agent framework catches it, formats it as a tool error message, and feeds it back to the LLM for recovery. The agent sees the error as data (not a crash) and can reason about what went wrong. This keeps the loop alive and lets the LLM self-correct: call a different tool, refine arguments, or report the failure to the user intentionally.
Analogy
Think of a waiter at a restaurant. If a customer orders a dish and the kitchen says 'we're out of that ingredient,' the waiter doesn't panic and walk out. Instead, they tell the customer the problem and ask 'Would you like the salmon instead?' The agent is like that waiter: when a tool fails, it doesn't crash; it receives the error message and decides the next best action.
Code
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.exceptions import ToolException
import json
@tool
def fetch_user_data(user_id: int) -> dict:
"""Fetch user information by ID. Raises ToolException on failure."""
if user_id < 1:
raise ToolException(f"Invalid user_id: {user_id}. Must be >= 1.")
if user_id == 999:
raise ToolException("Service unavailable: user database is down.")
return {"user_id": user_id, "name": f"User_{user_id}", "email": f"user{user_id}@example.com"}
@tool
def get_fallback_user() -> dict:
"""Return a default user when primary lookup fails."""
return {"user_id": 0, "name": "Guest", "email": "guest@example.com"}
def handle_tool_error(error: ToolException) -> str:
"""Convert tool error to a message the LLM can read and act on."""
return f"Tool failed with error: {str(error)}. Consider using fallback_user tool or ask user for valid input."
tools = [fetch_user_data, get_fallback_user]
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
llm_with_tools = llm.bind_tools(tools)
prompt = ChatPromptTemplate.from_messages([
("user", "Fetch user data for user_id {user_id}. If that fails, use fallback_user and return the result.")
])
chain = prompt | llm_with_tools
print("=== Test 1: Valid user_id ===")
try:
result = chain.invoke({"user_id": 42})
if hasattr(result, 'tool_calls'):
for tool_call in result.tool_calls:
print(f"Tool call: {tool_call['name']}({tool_call['args']})")
tool_obj = next(t for t in tools if t.name == tool_call['name'])
try:
output = tool_obj.invoke(tool_call['args'])
print(f"Tool output: {json.dumps(output, indent=2)}")
except ToolException as e:
print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
print(f"Chain error: {e}")
print("\n=== Test 2: Invalid user_id (negative) ===")
try:
result = chain.invoke({"user_id": -5})
if hasattr(result, 'tool_calls'):
for tool_call in result.tool_calls:
print(f"Tool call: {tool_call['name']}({tool_call['args']})")
tool_obj = next(t for t in tools if t.name == tool_call['name'])
try:
output = tool_obj.invoke(tool_call['args'])
print(f"Tool output: {json.dumps(output, indent=2)}")
except ToolException as e:
print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
print(f"Chain error: {e}")
print("\n=== Test 3: Service failure (user_id=999) ===")
try:
result = chain.invoke({"user_id": 999})
if hasattr(result, 'tool_calls'):
for tool_call in result.tool_calls:
print(f"Tool call: {tool_call['name']}({tool_call['args']})")
tool_obj = next(t for t in tools if t.name == tool_call['name'])
try:
output = tool_obj.invoke(tool_call['args'])
print(f"Tool output: {json.dumps(output, indent=2)}")
except ToolException as e:
print(f"Tool error caught: {handle_tool_error(e)}")
except Exception as e:
print(f"Chain error: {e}") === Test 1: Valid user_id ===
Tool call: fetch_user_data({'user_id': 42})
Tool output: {
"user_id": 42,
"name": "User_42",
"email": "user42@example.com"
}
=== Test 2: Invalid user_id (negative) ===
Tool call: fetch_user_data({'user_id': -5})
Tool error caught: Tool failed with error: Invalid user_id: -5. Must be >= 1.. Consider using fallback_user tool or ask user for valid input.
=== Test 3: Service failure (user_id=999) ===
Tool call: fetch_user_data({'user_id': 999})
Tool error caught: Tool failed with error: Service unavailable: user database is down.. Consider using fallback_user tool or ask user for valid input. What just happened?
The code defined two tools: <code>fetch_user_data</code> which raises <code>ToolException</code> on invalid or unavailable data, and <code>get_fallback_user</code> as a recovery option. For each test case, we invoked the chain with different user_ids. When the tool raised a <code>ToolException</code>, we caught it with a try/except block and converted it to a human-readable message that the LLM could read and act on. Valid inputs returned data; invalid or service-failure inputs triggered the error handler, producing a message the agent could use to reason about recovery.
Common gotcha
Developers often raise generic Python exceptions (ValueError, RuntimeError) inside tools and expect the agent to handle them. These crash the agent loop instead of being caught. You must raise ToolException explicitly, or wrap other exceptions and re-raise as ToolException, for the framework to catch and feed the error back to the LLM. Also: forgetting that ToolException messages are strings that the LLM reads: they must be clear and actionable, not cryptic error codes.
Error recovery
AttributeError: 'tool_call' object has no attribute 'name'ToolException not caught, agent crashesLLM sees the error but doesn't retryTool error message too long, causes token bloatExperienced dev note
In production, the difference between raising ToolException and a regular exception is the difference between a graceful degradation and a 500 error. But here's what most devs miss: even with ToolException, the LLM needs to know *how* to recover. If you only raise the error without giving it access to a fallback tool or retry logic, the LLM will just apologize to the user. Always pair error handling with alternative tools or explicit recovery instructions in the prompt. Also, log all ToolExceptions: not in the LLM message, but to your observability backend. Errors are data; they tell you which services are flaky.
Check your understanding
If a tool raises a ToolException with the message 'API rate limit exceeded', why does feeding that message back to the LLM not automatically solve the problem? What would the LLM need to actually recover?
Show answer hint
A correct answer recognizes that the LLM reads the error as text data, not as a trigger for built-in retry logic. Recovery requires either (1) an explicit retry tool that the LLM can call, (2) a fallback tool, or (3) a human-in-the-loop prompt. The error message alone doesn't fix the rate limit; the LLM needs *options* to act on.