How to measure LLM hallucination rate
Quick answer
Measure the
LLM hallucination rate by comparing model outputs against verified ground truth using benchmark datasets or human annotation. Calculate the percentage of outputs containing false or fabricated information to quantify hallucinations.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
This example demonstrates how to measure hallucination rate by prompting an LLM with factual questions, then comparing the answers to a ground truth dataset. We calculate the hallucination rate as the percentage of incorrect or fabricated answers.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample factual questions and ground truth answers
questions = [
"Who is the current president of the United States?",
"What is the capital of France?",
"Name the largest planet in our solar system."
]
ground_truth = [
"Joe Biden",
"Paris",
"Jupiter"
]
hallucinations = 0
for question, truth in zip(questions, ground_truth):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}]
)
answer = response.choices[0].message.content.strip()
print(f"Q: {question}")
print(f"Model answer: {answer}")
print(f"Ground truth: {truth}")
# Simple string match check (can be replaced with more robust evaluation)
if truth.lower() not in answer.lower():
hallucinations += 1
print("-> Hallucination detected\n")
else:
print("-> Correct answer\n")
hallucination_rate = hallucinations / len(questions) * 100
print(f"Hallucination rate: {hallucination_rate:.2f}%") output
Q: Who is the current president of the United States? Model answer: Joe Biden Ground truth: Joe Biden -> Correct answer Q: What is the capital of France? Model answer: Paris Ground truth: Paris -> Correct answer Q: Name the largest planet in our solar system. Model answer: Saturn Ground truth: Jupiter -> Hallucination detected Hallucination rate: 33.33%
Common variations
You can measure hallucination rate using:
- Human annotation: Have experts label outputs as hallucinated or factual.
- Automated metrics: Use factuality benchmarks like FEVER or TruthfulQA datasets.
- Different models: Test with
claude-3-5-sonnet-20241022orgemini-2.5-profor comparison. - Async calls: Use async SDK methods for batch evaluation at scale.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def evaluate_question(question, truth):
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}]
)
answer = response.choices[0].message.content.strip()
is_hallucinated = truth.lower() not in answer.lower()
return is_hallucinated, answer
async def main():
questions = [
"Who is the current president of the United States?",
"What is the capital of France?",
"Name the largest planet in our solar system."
]
ground_truth = ["Joe Biden", "Paris", "Jupiter"]
results = await asyncio.gather(*[evaluate_question(q, t) for q, t in zip(questions, ground_truth)])
hallucinations = sum(1 for h, _ in results if h)
hallucination_rate = hallucinations / len(questions) * 100
print(f"Hallucination rate: {hallucination_rate:.2f}%")
asyncio.run(main()) output
Hallucination rate: 33.33%
Troubleshooting
- If hallucination rate seems unexpectedly high, verify your ground truth data accuracy and consider more robust answer matching (e.g., semantic similarity).
- For ambiguous questions, hallucination detection may require human review.
- Ensure API keys are set correctly to avoid authentication errors.
- Use consistent model versions to maintain evaluation reliability.
Key Takeaways
- Use benchmark datasets or human annotation to identify hallucinated outputs accurately.
- Calculate hallucination rate as the percentage of incorrect or fabricated model responses.
- Automate evaluation with SDK async calls for scalable and repeatable measurement.
- Robust matching methods improve hallucination detection beyond simple string comparison.
- Consistent model versions and clean ground truth data are critical for reliable metrics.