How to test LLM output quality
Quick answer
Test LLM output quality by combining automated metrics like
BLEU, ROUGE, or perplexity with human evaluation for relevance and coherence. Use Python scripts to generate outputs via OpenAI or Anthropic SDKs and compare them against reference answers or criteria.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install nltk rouge-score
Setup
Install required Python packages and set your API key as an environment variable for secure access.
pip install openai nltk rouge-score
# Set environment variable in your shell
export OPENAI_API_KEY="your_api_key_here" output
Requirement already satisfied: openai ... Requirement already satisfied: nltk ... Requirement already satisfied: rouge-score ...
Step by step
This example uses the OpenAI SDK to generate LLM responses and evaluates output quality with nltk BLEU and rouge-score metrics against reference texts.
import os
from openai import OpenAI
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Reference and prompt
reference = "The quick brown fox jumps over the lazy dog."
prompt = "Write a sentence about a quick fox and a lazy dog."
# Generate LLM output
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content
print("LLM output:", output)
# Automated evaluation
reference_tokens = [reference.split()]
candidate_tokens = output.split()
bleu_score = sentence_bleu(reference_tokens, candidate_tokens)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = scorer.score(reference, output)
print(f"BLEU score: {bleu_score:.4f}")
print(f"ROUGE-1 F1 score: {rouge_scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-L F1 score: {rouge_scores['rougeL'].fmeasure:.4f}") output
LLM output: The quick brown fox jumps over a lazy dog. BLEU score: 0.7598 ROUGE-1 F1 score: 0.8889 ROUGE-L F1 score: 0.8571
Common variations
You can test output quality asynchronously or with other models like claude-3-5-haiku-20241022. Streaming output evaluation is also possible by accumulating tokens. Human evaluation remains essential for subjective quality aspects.
import asyncio
from anthropic import Anthropic
async def async_test():
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
response = await client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=256,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": "Describe a quick fox and lazy dog."}]
)
print("Claude output:", response.content)
asyncio.run(async_test()) output
Claude output: A quick brown fox swiftly jumps over a lazy dog resting in the sun.
Troubleshooting
- If BLEU or ROUGE scores are unexpectedly low, verify tokenization matches between reference and output.
- Ensure your API key is correctly set in
os.environto avoid authentication errors. - For inconsistent outputs, increase
max_tokensor adjust prompt clarity.
Key Takeaways
- Combine automated metrics like BLEU and ROUGE with human review for comprehensive LLM output quality testing.
- Use official SDKs like
OpenAIorAnthropicwith environment-secured API keys for reliable generation. - Tokenization consistency is critical for accurate metric evaluation.
- Test with multiple models and prompt variations to benchmark output quality effectively.