How to Intermediate · 3 min read

How to evaluate multi-step agent reasoning

Q: How to evaluate multi-step agent reasoning

To evaluate multi-step agent reasoning, use stepwise validation by checking intermediate outputs for correctness and coherence. Employ chain-of-thought prompting to expose reasoning steps, then apply automated metrics like accuracy, consistency, and logical coherence to assess performance.

Quick answer

To evaluate multi-step agent reasoning, use stepwise validation by checking intermediate outputs for correctness and coherence. Employ chain-of-thought prompting to expose reasoning steps, then apply automated metrics like accuracy, consistency, and logical coherence to assess performance.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).

bash

pip install openai

Step by step

This example demonstrates how to prompt a multi-step reasoning agent using chain-of-thought style and evaluate its intermediate reasoning steps for correctness.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Prompt with chain-of-thought to expose reasoning steps
messages = [
    {"role": "system", "content": "You are a helpful assistant that explains your reasoning step-by-step."},
    {"role": "user", "content": "If there are 3 apples and you buy 2 more, how many apples do you have? Show your reasoning."}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)

answer = response.choices[0].message.content
print("Agent response with reasoning steps:\n", answer)

# Simple evaluation: check if final answer matches expected
expected_final_answer = "5"

# Extract final numeric answer from response (naive approach)
import re
final_answer_match = re.search(r"\b(\d+)\b", answer[::-1])  # reverse to find last number
final_answer = final_answer_match.group(1)[::-1] if final_answer_match else None

if final_answer == expected_final_answer:
    print("Evaluation: PASS - Final answer is correct.")
else:
    print("Evaluation: FAIL - Final answer is incorrect.")

output

Agent response with reasoning steps:
 First, you have 3 apples. Then you buy 2 more apples. So, 3 + 2 = 5 apples in total.
Evaluation: PASS - Final answer is correct.

Common variations

You can enhance evaluation by:

Using max_tokens and temperature parameters to control output length and creativity.
Applying automated metrics like BLEU or ROUGE on intermediate steps for consistency.
Using other models like claude-3-5-haiku-20241022 for comparative evaluation.
Implementing asynchronous calls or streaming for real-time step validation.

python

from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

system_prompt = "You are a helpful assistant that explains your reasoning step-by-step."
user_prompt = "If there are 3 apples and you buy 2 more, how many apples do you have? Show your reasoning."

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=200,
    system=system_prompt,
    messages=[{"role": "user", "content": user_prompt}]
)

print("Claude response with reasoning steps:\n", response.content)

output

Claude response with reasoning steps:
 First, you start with 3 apples. Then you buy 2 more apples. Adding them together, 3 + 2 equals 5 apples in total.

Troubleshooting

If the agent's reasoning skips steps or is incomplete, increase max_tokens to allow longer responses.
If the final answer is incorrect, verify prompt clarity and consider adding explicit instructions to explain each step.
For noisy or inconsistent intermediate steps, use multiple runs and aggregate results to improve reliability.

✅

Key Takeaways

Use chain-of-thought prompting to expose multi-step reasoning explicitly.
Evaluate intermediate steps for logical coherence and final answer accuracy.
Leverage automated metrics and multiple model comparisons for robust evaluation.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗