How to evaluate summarization quality
Quick answer
Evaluate summarization quality by using automated metrics like
ROUGE or BERTScore to compare generated summaries against reference texts. You can also use LLM-based evaluation by prompting models such as gpt-4o to score or critique summaries for coherence and relevance.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install rouge-score bert-score
Setup
Install required Python packages and set your OPENAI_API_KEY environment variable for API access.
pip install openai rouge-score bert-score output
Collecting openai Collecting rouge-score Collecting bert-score Successfully installed openai rouge-score bert-score
Step by step
This example shows how to evaluate a generated summary against a reference summary using ROUGE and BERTScore, plus an LLM critique with gpt-4o.
import os
from openai import OpenAI
from rouge_score import rouge_scorer
from bert_score import score
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example texts
reference = "The cat sat on the mat and looked outside the window."
generated = "A cat was sitting on a mat looking out the window."
# ROUGE evaluation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = scorer.score(reference, generated)
# BERTScore evaluation
P, R, F1 = score([generated], [reference], lang='en', verbose=False)
print("ROUGE-1 F1:", rouge_scores['rouge1'].fmeasure)
print("ROUGE-L F1:", rouge_scores['rougeL'].fmeasure)
print(f"BERTScore F1: {F1[0].item():.4f}")
# LLM-based evaluation prompt
prompt = f"Evaluate the quality of this summary:\nSummary: {generated}\nReference: {reference}\nProvide a score from 1 to 10 and a brief explanation."
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
print("LLM evaluation:", response.choices[0].message.content) output
ROUGE-1 F1: 0.8571428571428571 ROUGE-L F1: 0.75 BERTScore F1: 0.9273 LLM evaluation: Score: 9/10\nThe summary captures the main points accurately and is coherent, with minor wording differences.
Common variations
- Use
asynccalls with the OpenAI SDK for non-blocking evaluation. - Try different models like
gpt-4o-minifor faster, cheaper LLM scoring. - Use other metrics such as
METEORorBLEUdepending on your summarization domain.
import asyncio
from openai import OpenAI
async def async_llm_evaluation():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Evaluate this summary quality from 1 to 10."
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
print("Async LLM evaluation:", response.choices[0].message.content)
asyncio.run(async_llm_evaluation()) output
Async LLM evaluation: Score: 8/10\nThe summary is mostly accurate but could be more concise.
Troubleshooting
- If you get authentication errors, verify your
OPENAI_API_KEYis set correctly in your environment. - For metric installation issues, ensure you have Python 3.8+ and run
pip install --upgrade pipbefore installing dependencies. - If LLM responses are incomplete, increase
max_tokensin thechat.completions.createcall.
Key Takeaways
- Use automated metrics like
ROUGEandBERTScorefor objective summarization evaluation. - Leverage
gpt-4oor similar LLMs to get qualitative scoring and explanations. - Set up environment variables and dependencies carefully to avoid runtime errors.