How to Intermediate · 4 min read

How to test LLM outputs systematically

Q: How to test LLM outputs systematically

To test LLM outputs systematically, automate input prompts and capture responses programmatically using SDKs like OpenAI or Anthropic. Then apply evaluation metrics such as accuracy, BLEU, or custom heuristics to measure output quality and consistency across test cases.

Quick answer

To test LLM outputs systematically, automate input prompts and capture responses programmatically using SDKs like OpenAI or Anthropic. Then apply evaluation metrics such as accuracy, BLEU, or custom heuristics to measure output quality and consistency across test cases.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

This example demonstrates how to send multiple prompts to gpt-4o, collect outputs, and perform a simple keyword match evaluation to test output correctness.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define test prompts and expected keywords
prompts = [
    "Translate 'Hello' to French.",
    "Summarize the plot of 'The Matrix'.",
    "What is the capital of Japan?"
]
expected_keywords = ["Bonjour", "Neo", "Tokyo"]

results = []

for prompt, keyword in zip(prompts, expected_keywords):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content
    passed = keyword.lower() in output.lower()
    results.append({"prompt": prompt, "output": output, "passed": passed})

for r in results:
    print(f"Prompt: {r['prompt']}")
    print(f"Output: {r['output']}")
    print(f"Test passed: {r['passed']}\n")

output

Prompt: Translate 'Hello' to French.
Output: Bonjour
Test passed: True

Prompt: Summarize the plot of 'The Matrix'.
Output: The Matrix follows Neo, who discovers the reality is a simulation.
Test passed: True

Prompt: What is the capital of Japan?
Output: The capital of Japan is Tokyo.
Test passed: True

Common variations

You can extend testing by using asynchronous calls for faster batch processing, streaming outputs for real-time validation, or switching models like claude-3-5-sonnet-20241022 for comparative evaluation.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def test_prompt(prompt, keyword):
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content
    return {"prompt": prompt, "output": output, "passed": keyword.lower() in output.lower()}

async def main():
    prompts = ["Translate 'Hello' to French.", "What is the capital of Japan?"]
    keywords = ["Bonjour", "Tokyo"]
    tasks = [test_prompt(p, k) for p, k in zip(prompts, keywords)]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(f"Prompt: {r['prompt']}")
        print(f"Output: {r['output']}")
        print(f"Test passed: {r['passed']}\n")

asyncio.run(main())

output

Prompt: Translate 'Hello' to French.
Output: Bonjour
Test passed: True

Prompt: What is the capital of Japan?
Output: The capital of Japan is Tokyo.
Test passed: True

Troubleshooting

If you get authentication errors, verify your API key is set correctly in os.environ["OPENAI_API_KEY"].
If outputs are inconsistent, increase temperature to reduce randomness or use logprobs to debug token probabilities.
For rate limits, implement retries with exponential backoff.

Key Takeaways

Automate LLM output testing by scripting prompt submissions and capturing responses via SDKs.
Use simple heuristics or formal metrics to evaluate output correctness and consistency.
Leverage async calls and streaming for efficient large-scale testing.
Adjust model parameters like temperature to control output variability during tests.
Handle API errors and rate limits gracefully to maintain robust testing pipelines.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.