How to Intermediate · 3 min read

How to measure LLM hallucination rate

Quick answer
Measure the LLM hallucination rate by comparing model outputs against verified ground truth using benchmark datasets or human annotation. Calculate the percentage of outputs containing false or fabricated information to quantify hallucinations.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

This example demonstrates how to measure hallucination rate by prompting an LLM with factual questions, then comparing the answers to a ground truth dataset. We calculate the hallucination rate as the percentage of incorrect or fabricated answers.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample factual questions and ground truth answers
questions = [
    "Who is the current president of the United States?",
    "What is the capital of France?",
    "Name the largest planet in our solar system."
]
ground_truth = [
    "Joe Biden",
    "Paris",
    "Jupiter"
]

hallucinations = 0

for question, truth in zip(questions, ground_truth):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    answer = response.choices[0].message.content.strip()
    print(f"Q: {question}")
    print(f"Model answer: {answer}")
    print(f"Ground truth: {truth}")

    # Simple string match check (can be replaced with more robust evaluation)
    if truth.lower() not in answer.lower():
        hallucinations += 1
        print("-> Hallucination detected\n")
    else:
        print("-> Correct answer\n")

hallucination_rate = hallucinations / len(questions) * 100
print(f"Hallucination rate: {hallucination_rate:.2f}%")
output
Q: Who is the current president of the United States?
Model answer: Joe Biden
Ground truth: Joe Biden
-> Correct answer

Q: What is the capital of France?
Model answer: Paris
Ground truth: Paris
-> Correct answer

Q: Name the largest planet in our solar system.
Model answer: Saturn
Ground truth: Jupiter
-> Hallucination detected

Hallucination rate: 33.33%

Common variations

You can measure hallucination rate using:

  • Human annotation: Have experts label outputs as hallucinated or factual.
  • Automated metrics: Use factuality benchmarks like FEVER or TruthfulQA datasets.
  • Different models: Test with claude-3-5-sonnet-20241022 or gemini-2.5-pro for comparison.
  • Async calls: Use async SDK methods for batch evaluation at scale.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def evaluate_question(question, truth):
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": question}]
    )
    answer = response.choices[0].message.content.strip()
    is_hallucinated = truth.lower() not in answer.lower()
    return is_hallucinated, answer

async def main():
    questions = [
        "Who is the current president of the United States?",
        "What is the capital of France?",
        "Name the largest planet in our solar system."
    ]
    ground_truth = ["Joe Biden", "Paris", "Jupiter"]

    results = await asyncio.gather(*[evaluate_question(q, t) for q, t in zip(questions, ground_truth)])

    hallucinations = sum(1 for h, _ in results if h)
    hallucination_rate = hallucinations / len(questions) * 100
    print(f"Hallucination rate: {hallucination_rate:.2f}%")

asyncio.run(main())
output
Hallucination rate: 33.33%

Troubleshooting

  • If hallucination rate seems unexpectedly high, verify your ground truth data accuracy and consider more robust answer matching (e.g., semantic similarity).
  • For ambiguous questions, hallucination detection may require human review.
  • Ensure API keys are set correctly to avoid authentication errors.
  • Use consistent model versions to maintain evaluation reliability.

Key Takeaways

  • Use benchmark datasets or human annotation to identify hallucinated outputs accurately.
  • Calculate hallucination rate as the percentage of incorrect or fabricated model responses.
  • Automate evaluation with SDK async calls for scalable and repeatable measurement.
  • Robust matching methods improve hallucination detection beyond simple string comparison.
  • Consistent model versions and clean ground truth data are critical for reliable metrics.
Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022, gemini-2.5-pro
Verify ↗