How to Intermediate · 4 min read

How to evaluate tool use accuracy in agents

Quick answer

Evaluate tool use accuracy in agents by logging tool invocation inputs and outputs, then comparing them against expected results using automated tests or human review. Use metrics like precision, recall, and success rate on tool calls to quantify accuracy.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for authentication.

bash

pip install openai>=1.0

Step by step

Implement a logging mechanism to capture each tool call's input and output during agent execution. Then, compare these logs against a ground truth dataset to calculate accuracy metrics.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example: Agent calls a tool (e.g., calculator) and logs usage

def call_tool_and_log(tool_input, log):
    # Simulate tool call via LLM or external API
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Calculate: {tool_input}"}]
    )
    tool_output = response.choices[0].message.content
    # Log input and output
    log.append({"input": tool_input, "output": tool_output})
    return tool_output

# Ground truth for evaluation
expected_results = {
    "2 + 2": "4",
    "5 * 3": "15",
    "10 / 2": "5"
}

# Run agent and log tool use
log = []
for expr in expected_results.keys():
    call_tool_and_log(expr, log)

# Evaluate accuracy
correct = 0
for entry in log:
    expected = expected_results.get(entry["input"])
    if expected and expected.strip() == entry["output"].strip():
        correct += 1
accuracy = correct / len(expected_results)
print(f"Tool use accuracy: {accuracy:.2%}")

output

Tool use accuracy: 100.00%

Common variations

You can extend evaluation by using asynchronous calls, streaming outputs, or different models like claude-3-5-haiku-20241022. Also, integrate human review for ambiguous cases or use automated unit tests for tool APIs.

python

from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

async def call_tool_async(tool_input, log):
    message = await client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=100,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": f"Calculate: {tool_input}"}]
    )
    tool_output = message.content
    log.append({"input": tool_input, "output": tool_output})
    return tool_output

# Note: This requires an async event loop to run properly.

Troubleshooting

If accuracy is low, verify your ground truth data matches expected tool outputs exactly.
Check for formatting differences or extra whitespace in outputs that may cause false mismatches.
Ensure your logging captures all tool calls without missing any.
Use detailed error logs to identify failed tool invocations or API errors.

✅

Key Takeaways

Log every tool invocation input and output for accurate evaluation.
Compare logged outputs against a verified ground truth dataset.
Use metrics like precision, recall, or simple accuracy to quantify tool use correctness.
Incorporate human review for ambiguous or complex tool outputs.
Automate evaluation with unit tests and continuous monitoring for agent reliability.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗