What is BERTScore for LLMs
BERT to measure semantic similarity between generated and reference texts. It improves over traditional metrics by capturing meaning rather than exact token matches, making it ideal for assessing LLM outputs.How it works
BERTScore evaluates text similarity by comparing token embeddings generated by a pretrained transformer model such as BERT. Instead of counting exact word overlaps like BLEU or ROUGE, it computes cosine similarity between contextual embeddings of tokens in the candidate and reference sentences. This captures semantic meaning even if the wording differs.
Think of it like comparing two paintings not by matching brush strokes exactly, but by comparing the overall style and colors to see how similar they feel.
Concrete example
Here is a Python example using the bertscore library to evaluate similarity between an LLM-generated sentence and a reference:
import os
from bert_score import score
# Candidate and reference sentences
candidate = ["The cat sat on the mat."]
reference = ["A cat was sitting on the mat."]
# Compute BERTScore using default model
P, R, F1 = score(candidates=candidate, references=reference, lang="en")
print(f"Precision: {P.mean().item():.4f}")
print(f"Recall: {R.mean().item():.4f}")
print(f"F1 Score: {F1.mean().item():.4f}") Precision: 0.9602 Recall: 0.9578 F1 Score: 0.9590
When to use it
Use BERTScore when you need a more semantically aware evaluation of LLM generated text, such as in summarization, translation, or dialogue generation. It is especially useful when exact word overlap metrics fail due to paraphrasing or synonym use.
Do not use it when you require strict lexical matching or when evaluating very short texts where embedding similarity may be less reliable.
Key terms
| Term | Definition |
|---|---|
| BERTScore | An evaluation metric using contextual embeddings to measure semantic similarity between texts. |
| Contextual embeddings | Vector representations of words that capture meaning based on surrounding context. |
| Cosine similarity | A measure of similarity between two vectors based on the cosine of the angle between them. |
| LLM | Large Language Model, a transformer-based model trained on vast text data. |
| BLEU | A traditional n-gram based metric for evaluating machine translation quality. |
Key Takeaways
- BERTScore uses transformer embeddings to evaluate semantic similarity, not just exact word matches.
- It provides more meaningful evaluation for LLM outputs that paraphrase or rephrase content.
- Use BERTScore for tasks like summarization and translation where meaning matters more than exact wording.