How to Intermediate · 3 min read

How to evaluate multi-step agent reasoning

Quick answer
To evaluate multi-step agent reasoning, use stepwise validation by checking intermediate outputs for correctness and coherence. Employ chain-of-thought prompting to expose reasoning steps, then apply automated metrics like accuracy, consistency, and logical coherence to assess performance.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).
bash
pip install openai

Step by step

This example demonstrates how to prompt a multi-step reasoning agent using chain-of-thought style and evaluate its intermediate reasoning steps for correctness.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Prompt with chain-of-thought to expose reasoning steps
messages = [
    {"role": "system", "content": "You are a helpful assistant that explains your reasoning step-by-step."},
    {"role": "user", "content": "If there are 3 apples and you buy 2 more, how many apples do you have? Show your reasoning."}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)

answer = response.choices[0].message.content
print("Agent response with reasoning steps:\n", answer)

# Simple evaluation: check if final answer matches expected
expected_final_answer = "5"

# Extract final numeric answer from response (naive approach)
import re
final_answer_match = re.search(r"\b(\d+)\b", answer[::-1])  # reverse to find last number
final_answer = final_answer_match.group(1)[::-1] if final_answer_match else None

if final_answer == expected_final_answer:
    print("Evaluation: PASS - Final answer is correct.")
else:
    print("Evaluation: FAIL - Final answer is incorrect.")
output
Agent response with reasoning steps:
 First, you have 3 apples. Then you buy 2 more apples. So, 3 + 2 = 5 apples in total.
Evaluation: PASS - Final answer is correct.

Common variations

You can enhance evaluation by:

  • Using max_tokens and temperature parameters to control output length and creativity.
  • Applying automated metrics like BLEU or ROUGE on intermediate steps for consistency.
  • Using other models like claude-3-5-haiku-20241022 for comparative evaluation.
  • Implementing asynchronous calls or streaming for real-time step validation.
python
from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

system_prompt = "You are a helpful assistant that explains your reasoning step-by-step."
user_prompt = "If there are 3 apples and you buy 2 more, how many apples do you have? Show your reasoning."

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=200,
    system=system_prompt,
    messages=[{"role": "user", "content": user_prompt}]
)

print("Claude response with reasoning steps:\n", response.content)
output
Claude response with reasoning steps:
 First, you start with 3 apples. Then you buy 2 more apples. Adding them together, 3 + 2 equals 5 apples in total.

Troubleshooting

  • If the agent's reasoning skips steps or is incomplete, increase max_tokens to allow longer responses.
  • If the final answer is incorrect, verify prompt clarity and consider adding explicit instructions to explain each step.
  • For noisy or inconsistent intermediate steps, use multiple runs and aggregate results to improve reliability.

Key Takeaways

  • Use chain-of-thought prompting to expose multi-step reasoning explicitly.
  • Evaluate intermediate steps for logical coherence and final answer accuracy.
  • Leverage automated metrics and multiple model comparisons for robust evaluation.
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022
Verify ↗