How to beginner · 3 min read

How to test LLM output quality

Quick answer
Test LLM output quality by combining automated metrics like BLEU, ROUGE, or perplexity with human evaluation for relevance and coherence. Use Python scripts to generate outputs via OpenAI or Anthropic SDKs and compare them against reference answers or criteria.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • pip install nltk rouge-score

Setup

Install required Python packages and set your API key as an environment variable for secure access.

bash
pip install openai nltk rouge-score

# Set environment variable in your shell
export OPENAI_API_KEY="your_api_key_here"
output
Requirement already satisfied: openai ...
Requirement already satisfied: nltk ...
Requirement already satisfied: rouge-score ...

Step by step

This example uses the OpenAI SDK to generate LLM responses and evaluates output quality with nltk BLEU and rouge-score metrics against reference texts.

python
import os
from openai import OpenAI
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Reference and prompt
reference = "The quick brown fox jumps over the lazy dog."
prompt = "Write a sentence about a quick fox and a lazy dog."

# Generate LLM output
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content
print("LLM output:", output)

# Automated evaluation
reference_tokens = [reference.split()]
candidate_tokens = output.split()
bleu_score = sentence_bleu(reference_tokens, candidate_tokens)

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = scorer.score(reference, output)

print(f"BLEU score: {bleu_score:.4f}")
print(f"ROUGE-1 F1 score: {rouge_scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-L F1 score: {rouge_scores['rougeL'].fmeasure:.4f}")
output
LLM output: The quick brown fox jumps over a lazy dog.
BLEU score: 0.7598
ROUGE-1 F1 score: 0.8889
ROUGE-L F1 score: 0.8571

Common variations

You can test output quality asynchronously or with other models like claude-3-5-haiku-20241022. Streaming output evaluation is also possible by accumulating tokens. Human evaluation remains essential for subjective quality aspects.

python
import asyncio
from anthropic import Anthropic

async def async_test():
    client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    response = await client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=256,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": "Describe a quick fox and lazy dog."}]
    )
    print("Claude output:", response.content)

asyncio.run(async_test())
output
Claude output: A quick brown fox swiftly jumps over a lazy dog resting in the sun.

Troubleshooting

  • If BLEU or ROUGE scores are unexpectedly low, verify tokenization matches between reference and output.
  • Ensure your API key is correctly set in os.environ to avoid authentication errors.
  • For inconsistent outputs, increase max_tokens or adjust prompt clarity.

Key Takeaways

  • Combine automated metrics like BLEU and ROUGE with human review for comprehensive LLM output quality testing.
  • Use official SDKs like OpenAI or Anthropic with environment-secured API keys for reliable generation.
  • Tokenization consistency is critical for accurate metric evaluation.
  • Test with multiple models and prompt variations to benchmark output quality effectively.
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022
Verify ↗