How to evaluate AI agent performance
prompt engineering and benchmark datasets to systematically assess agent capabilities and robustness.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to interact with AI agents programmatically.
pip install openai Step by step
Use the OpenAI API to send tasks to your AI agent and evaluate its responses against expected outputs. Measure metrics like success rate, response time, and qualitative scores.
import os
from openai import OpenAI
import time
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define test cases with input prompts and expected outputs
test_cases = [
{"input": "Translate 'Hello' to French.", "expected": "Bonjour"},
{"input": "What is 5 plus 7?", "expected": "12"},
{"input": "Summarize the plot of 'The Matrix'.", "expected": None} # subjective
]
success_count = 0
latencies = []
for case in test_cases:
start = time.time()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": case["input"]}]
)
latency = time.time() - start
latencies.append(latency)
answer = response.choices[0].message.content.strip()
# Simple exact match for objective tasks
if case["expected"]:
if answer.lower() == case["expected"].lower():
success_count += 1
else:
# For subjective tasks, manual or heuristic evaluation needed
print(f"Input: {case['input']}\nAgent answer: {answer}\n")
print(f"Success rate: {success_count}/{len(test_cases)}")
print(f"Average latency: {sum(latencies)/len(latencies):.2f} seconds") Input: Summarize the plot of 'The Matrix'. Agent answer: The Matrix is a sci-fi film where a hacker discovers reality is a simulation controlled by machines. Success rate: 2/3 Average latency: 1.23 seconds
Common variations
You can evaluate asynchronously, use streaming responses for latency measurement, or switch models like claude-3-5-sonnet-20241022 for better coding or reasoning tasks. Human evaluation can complement automated metrics for subjective tasks.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def evaluate_async(prompt):
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
prompts = ["Explain quantum computing.", "Write a Python function to reverse a string."]
results = await asyncio.gather(*(evaluate_async(p) for p in prompts))
for prompt, answer in zip(prompts, results):
print(f"Prompt: {prompt}\nAnswer: {answer}\n")
asyncio.run(main()) Prompt: Explain quantum computing. Answer: Quantum computing uses quantum bits that can be in multiple states simultaneously, enabling complex computations. Prompt: Write a Python function to reverse a string. Answer: def reverse_string(s):\n return s[::-1]
Troubleshooting
If you see inconsistent results, ensure your prompts are clear and unambiguous. For latency spikes, check your network and API usage limits. Use logging to track agent responses and errors for debugging.
Key Takeaways
- Measure AI agent performance with objective metrics like success rate and latency.
- Use both automated tests and human evaluation for subjective tasks.
- Leverage asynchronous calls and streaming for efficient performance measurement.