Concept Intermediate · 4 min read

What is BERTScore for LLMs

Quick answer

BERTScore for LLMs is an evaluation metric that uses contextual embeddings from transformer models like BERT to measure semantic similarity between generated and reference texts. It improves over traditional metrics by capturing meaning rather than exact token matches, making it ideal for assessing LLM outputs.

BERTScore is an evaluation metric that uses contextual embeddings from transformer models to measure semantic similarity between texts.

How it works

BERTScore evaluates text similarity by comparing token embeddings generated by a pretrained transformer model such as BERT. Instead of counting exact word overlaps like BLEU or ROUGE, it computes cosine similarity between contextual embeddings of tokens in the candidate and reference sentences. This captures semantic meaning even if the wording differs.

Think of it like comparing two paintings not by matching brush strokes exactly, but by comparing the overall style and colors to see how similar they feel.

Concrete example

Here is a Python example using the bertscore library to evaluate similarity between an LLM-generated sentence and a reference:

python

import os
from bert_score import score

# Candidate and reference sentences
candidate = ["The cat sat on the mat."]
reference = ["A cat was sitting on the mat."]

# Compute BERTScore using default model
P, R, F1 = score(candidates=candidate, references=reference, lang="en")

print(f"Precision: {P.mean().item():.4f}")
print(f"Recall: {R.mean().item():.4f}")
print(f"F1 Score: {F1.mean().item():.4f}")

output

Precision: 0.9602
Recall: 0.9578
F1 Score: 0.9590

When to use it

Use BERTScore when you need a more semantically aware evaluation of LLM generated text, such as in summarization, translation, or dialogue generation. It is especially useful when exact word overlap metrics fail due to paraphrasing or synonym use.

Do not use it when you require strict lexical matching or when evaluating very short texts where embedding similarity may be less reliable.

Key terms

Term	Definition
BERTScore	An evaluation metric using contextual embeddings to measure semantic similarity between texts.
Contextual embeddings	Vector representations of words that capture meaning based on surrounding context.
Cosine similarity	A measure of similarity between two vectors based on the cosine of the angle between them.
LLM	Large Language Model, a transformer-based model trained on vast text data.
BLEU	A traditional n-gram based metric for evaluating machine translation quality.

✅

Key Takeaways

BERTScore uses transformer embeddings to evaluate semantic similarity, not just exact word matches.
It provides more meaningful evaluation for LLM outputs that paraphrase or rephrase content.
Use BERTScore for tasks like summarization and translation where meaning matters more than exact wording.

Verified 2026-04 · BERT, gpt-4o, claude-3-5-sonnet-20241022

Verify ↗