How to evaluate tool use accuracy in agents
Quick answer
Evaluate tool use accuracy in agents by logging tool invocation inputs and outputs, then comparing them against expected results using automated tests or human review. Use metrics like precision, recall, and success rate on tool calls to quantify accuracy.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable for authentication.
pip install openai>=1.0 Step by step
Implement a logging mechanism to capture each tool call's input and output during agent execution. Then, compare these logs against a ground truth dataset to calculate accuracy metrics.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example: Agent calls a tool (e.g., calculator) and logs usage
def call_tool_and_log(tool_input, log):
# Simulate tool call via LLM or external API
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Calculate: {tool_input}"}]
)
tool_output = response.choices[0].message.content
# Log input and output
log.append({"input": tool_input, "output": tool_output})
return tool_output
# Ground truth for evaluation
expected_results = {
"2 + 2": "4",
"5 * 3": "15",
"10 / 2": "5"
}
# Run agent and log tool use
log = []
for expr in expected_results.keys():
call_tool_and_log(expr, log)
# Evaluate accuracy
correct = 0
for entry in log:
expected = expected_results.get(entry["input"])
if expected and expected.strip() == entry["output"].strip():
correct += 1
accuracy = correct / len(expected_results)
print(f"Tool use accuracy: {accuracy:.2%}") output
Tool use accuracy: 100.00%
Common variations
You can extend evaluation by using asynchronous calls, streaming outputs, or different models like claude-3-5-haiku-20241022. Also, integrate human review for ambiguous cases or use automated unit tests for tool APIs.
from anthropic import Anthropic
import os
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
async def call_tool_async(tool_input, log):
message = await client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=100,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": f"Calculate: {tool_input}"}]
)
tool_output = message.content
log.append({"input": tool_input, "output": tool_output})
return tool_output
# Note: This requires an async event loop to run properly. Troubleshooting
- If accuracy is low, verify your ground truth data matches expected tool outputs exactly.
- Check for formatting differences or extra whitespace in outputs that may cause false mismatches.
- Ensure your logging captures all tool calls without missing any.
- Use detailed error logs to identify failed tool invocations or API errors.
Key Takeaways
- Log every tool invocation input and output for accurate evaluation.
- Compare logged outputs against a verified ground truth dataset.
- Use metrics like precision, recall, or simple accuracy to quantify tool use correctness.
- Incorporate human review for ambiguous or complex tool outputs.
- Automate evaluation with unit tests and continuous monitoring for agent reliability.