How to Intermediate · 3 min read

How to measure LLM answer correctness

Quick answer
Measure LLM answer correctness by comparing generated outputs against reference answers using metrics like exact match, F1 score, or semantic similarity with embeddings. Use OpenAI or Anthropic SDKs to generate answers and libraries like sentence-transformers for embedding-based evaluation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0 sentence-transformers

Setup

Install the required Python packages and set your API key as an environment variable.

  • Install OpenAI SDK and sentence-transformers for embeddings:
bash
pip install openai sentence-transformers
output
Collecting openai
Collecting sentence-transformers
Successfully installed openai sentence-transformers

Step by step

This example uses OpenAI SDK to generate answers and sentence-transformers to compute semantic similarity against reference answers. It demonstrates exact match and embedding similarity metrics.

python
import os
from openai import OpenAI
from sentence_transformers import SentenceTransformer, util

# Initialize clients
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
embed_model = SentenceTransformer('all-MiniLM-L6-v2')

# Reference answer and prompt
reference_answer = "Paris is the capital of France."
prompt = "What is the capital of France?"

# Generate LLM answer
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
generated_answer = response.choices[0].message.content.strip()

print(f"Generated answer: {generated_answer}")

# Exact match metric
exact_match = generated_answer.lower() == reference_answer.lower()
print(f"Exact match: {exact_match}")

# Semantic similarity metric
ref_embedding = embed_model.encode(reference_answer, convert_to_tensor=True)
gen_embedding = embed_model.encode(generated_answer, convert_to_tensor=True)
similarity = util.pytorch_cos_sim(ref_embedding, gen_embedding).item()
print(f"Semantic similarity (cosine): {similarity:.4f}")
output
Generated answer: Paris is the capital of France.
Exact match: True
Semantic similarity (cosine): 0.9876

Common variations

You can measure correctness asynchronously or use different models like claude-3-5-haiku-20241022. For streaming, accumulate tokens and evaluate after completion. Embedding models can be swapped for more powerful ones depending on accuracy needs.

python
import asyncio
import os
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer, util

async def async_measure():
    client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    embed_model = SentenceTransformer('all-MiniLM-L6-v2')

    prompt = "What is the capital of France?"
    reference_answer = "Paris is the capital of France."

    response = await client.messages.create(
        model="claude-3-5-haiku-20241022",
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    generated_answer = response.content[0].text.strip()

    print(f"Generated answer: {generated_answer}")

    exact_match = generated_answer.lower() == reference_answer.lower()
    print(f"Exact match: {exact_match}")

    ref_embedding = embed_model.encode(reference_answer, convert_to_tensor=True)
    gen_embedding = embed_model.encode(generated_answer, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(ref_embedding, gen_embedding).item()
    print(f"Semantic similarity (cosine): {similarity:.4f}")

asyncio.run(async_measure())
output
Generated answer: Paris is the capital of France.
Exact match: True
Semantic similarity (cosine): 0.9854

Troubleshooting

  • If exact match is always false due to minor wording differences, use semantic similarity or token-based F1 score instead.
  • If embedding similarity is low, verify you use the same embedding model for both reference and generated answers.
  • Ensure API keys are set correctly in environment variables to avoid authentication errors.

Key Takeaways

  • Use exact match and semantic similarity metrics to measure LLM answer correctness effectively.
  • Leverage embedding models like sentence-transformers for robust semantic evaluation beyond exact text matching.
  • Always compare generated answers against trusted reference answers for reliable correctness measurement.
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022, all-MiniLM-L6-v2
Verify ↗