What is LLM evaluation
How it works
LLM evaluation works by comparing a model's generated outputs against reference answers or expected behaviors using metrics like perplexity, BLEU, or human ratings. Think of it like grading a student's essay: you check for correctness, clarity, and relevance. Automated metrics provide quick scores, while human evaluation captures nuance and safety concerns. This combined approach ensures the model performs well across tasks and avoids harmful outputs.
Concrete example
Here is a simple Python example using the OpenAI SDK to evaluate a model's response quality by comparing it to a reference answer using a similarity score (cosine similarity) with embeddings:
import os
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Reference answer and model output
reference = "The capital of France is Paris."
model_output = "Paris is the capital city of France."
# Get embeddings for both texts
ref_embedding = client.embeddings.create(input=reference, model="text-embedding-3-large").data[0].embedding
output_embedding = client.embeddings.create(input=model_output, model="text-embedding-3-large").data[0].embedding
# Calculate cosine similarity
similarity = cosine_similarity([ref_embedding], [output_embedding])[0][0]
print(f"Similarity score: {similarity:.4f}") Similarity score: 0.9876
When to use it
Use LLM evaluation when you need to verify that a language model meets quality standards for tasks like summarization, translation, or question answering. It is essential before deploying models in production to ensure accuracy, fairness, and safety. Avoid relying solely on automated metrics; combine with human review for sensitive or high-stakes applications.
Key terms
| Term | Definition |
|---|---|
| Perplexity | A metric measuring how well a model predicts a sample; lower is better. |
| BLEU | A score evaluating the overlap between generated and reference text, used in translation. |
| Human evaluation | Qualitative assessment by people to judge output quality and safety. |
| Cosine similarity | A measure of similarity between two vectors, often used with embeddings. |
| Benchmark | A standardized test dataset or task used to compare model performance. |
Key Takeaways
- LLM evaluation combines automated metrics and human judgment to measure model quality.
- Use embeddings and similarity scores for quantitative comparison of outputs.
- Human review is critical for assessing safety and nuanced language understanding.