How to Intermediate · 4 min read

How to evaluate summarization quality

Quick answer

Evaluate summarization quality by using automated metrics like ROUGE or BERTScore to compare generated summaries against reference texts. You can also use LLM-based evaluation by prompting models such as gpt-4o to score or critique summaries for coherence and relevance.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install rouge-score bert-score

Setup

Install required Python packages and set your OPENAI_API_KEY environment variable for API access.

bash

pip install openai rouge-score bert-score

output

Collecting openai
Collecting rouge-score
Collecting bert-score
Successfully installed openai rouge-score bert-score

Step by step

This example shows how to evaluate a generated summary against a reference summary using ROUGE and BERTScore, plus an LLM critique with gpt-4o.

python

import os
from openai import OpenAI
from rouge_score import rouge_scorer
from bert_score import score

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example texts
reference = "The cat sat on the mat and looked outside the window."
generated = "A cat was sitting on a mat looking out the window."

# ROUGE evaluation
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = scorer.score(reference, generated)

# BERTScore evaluation
P, R, F1 = score([generated], [reference], lang='en', verbose=False)

print("ROUGE-1 F1:", rouge_scores['rouge1'].fmeasure)
print("ROUGE-L F1:", rouge_scores['rougeL'].fmeasure)
print(f"BERTScore F1: {F1[0].item():.4f}")

# LLM-based evaluation prompt
prompt = f"Evaluate the quality of this summary:\nSummary: {generated}\nReference: {reference}\nProvide a score from 1 to 10 and a brief explanation."

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print("LLM evaluation:", response.choices[0].message.content)

output

ROUGE-1 F1: 0.8571428571428571
ROUGE-L F1: 0.75
BERTScore F1: 0.9273
LLM evaluation: Score: 9/10\nThe summary captures the main points accurately and is coherent, with minor wording differences.

Common variations

Use async calls with the OpenAI SDK for non-blocking evaluation.
Try different models like gpt-4o-mini for faster, cheaper LLM scoring.
Use other metrics such as METEOR or BLEU depending on your summarization domain.

python

import asyncio
from openai import OpenAI

async def async_llm_evaluation():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = "Evaluate this summary quality from 1 to 10."
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    print("Async LLM evaluation:", response.choices[0].message.content)

asyncio.run(async_llm_evaluation())

output

Async LLM evaluation: Score: 8/10\nThe summary is mostly accurate but could be more concise.

Troubleshooting

If you get authentication errors, verify your OPENAI_API_KEY is set correctly in your environment.
For metric installation issues, ensure you have Python 3.8+ and run pip install --upgrade pip before installing dependencies.
If LLM responses are incomplete, increase max_tokens in the chat.completions.create call.

✅

Key Takeaways

Use automated metrics like ROUGE and BERTScore for objective summarization evaluation.
Leverage gpt-4o or similar LLMs to get qualitative scoring and explanations.
Set up environment variables and dependencies carefully to avoid runtime errors.

Verified 2026-04 · gpt-4o-mini

Verify ↗