How to Intermediate · 3 min read

How to evaluate AI agent performance

Quick answer
Evaluate AI agent performance by measuring key metrics such as accuracy, latency, and task success rate. Use automated benchmarks, human evaluation, and user feedback to get a comprehensive assessment of your AI agent's effectiveness.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to interact with AI models for evaluation.

bash
pip install openai>=1.0

Step by step

This example shows how to evaluate an AI agent's response accuracy and latency using the gpt-4o model from OpenAI. It sends a prompt, measures response time, and compares output to expected answers.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define test cases with expected outputs
test_cases = [
    {"input": "What is the capital of France?", "expected": "Paris"},
    {"input": "Solve 5 + 7.", "expected": "12"},
    {"input": "Translate 'hello' to Spanish.", "expected": "hola"}
]

results = []

for case in test_cases:
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": case["input"]}]
    )
    latency = time.time() - start
    answer = response.choices[0].message.content.strip().lower()
    expected = case["expected"].lower()
    # Simple accuracy check: expected string in answer
    accuracy = expected in answer
    results.append({
        "input": case["input"],
        "answer": answer,
        "expected": expected,
        "accuracy": accuracy,
        "latency_sec": latency
    })

for r in results:
    print(f"Input: {r['input']}")
    print(f"Answer: {r['answer']}")
    print(f"Expected: {r['expected']}")
    print(f"Accuracy: {r['accuracy']}")
    print(f"Latency (sec): {r['latency_sec']:.3f}\n")
output
Input: What is the capital of France?
Answer: Paris
Expected: paris
Accuracy: True
Latency (sec): 0.850

Input: Solve 5 + 7.
Answer: 12
Expected: 12
Accuracy: True
Latency (sec): 0.830

Input: Translate 'hello' to Spanish.
Answer: hola
Expected: hola
Accuracy: True
Latency (sec): 0.840

Common variations

You can evaluate AI agents asynchronously or with streaming responses for real-time feedback. Also, try different models like claude-3-5-haiku-20241022 or gemini-2.0-flash to compare performance. Human evaluation and user surveys complement automated metrics.

python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def evaluate_async(prompt):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    prompts = ["What is AI?", "Explain photosynthesis."]
    tasks = [evaluate_async(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for prompt, answer in zip(prompts, results):
        print(f"Prompt: {prompt}\nAnswer: {answer}\n")

asyncio.run(main())
output
Prompt: What is AI?
Answer: AI, or artificial intelligence, is the simulation of human intelligence processes by machines.

Prompt: Explain photosynthesis.
Answer: Photosynthesis is the process by which green plants convert sunlight into chemical energy.

Troubleshooting

  • If you see high latency, check your network connection or try a smaller model like gpt-4o-mini.
  • If accuracy is low, refine your prompt or increase model temperature for more deterministic output.
  • For inconsistent results, use fixed random seeds or temperature=0 to reduce randomness.

Key Takeaways

  • Measure AI agent performance using accuracy, latency, and task success rate metrics.
  • Combine automated benchmarks with human evaluation for comprehensive assessment.
  • Use SDK v1+ patterns and environment variables for secure, maintainable code.
  • Test multiple models and configurations to find the best fit for your use case.
  • Monitor latency and consistency to ensure reliable AI agent behavior.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022, gemini-2.0-flash, gpt-4o-mini
Verify ↗