How to Intermediate · 3 min read

How to measure LLM faithfulness

Quick answer

Measure LLM faithfulness by comparing the model's generated outputs against verified ground truth using metrics like ROUGE, BLEU, and factual consistency checks. Use Python libraries and AI APIs to automate evaluation with reference texts and prompt engineering.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install rouge-score
pip install nltk

Setup

Install required Python packages and set your OpenAI API key as an environment variable.

Install packages: openai for API calls, rouge-score and nltk for evaluation metrics.
Set environment variable OPENAI_API_KEY with your API key.

bash

pip install openai rouge-score nltk

output

Collecting openai
Collecting rouge-score
Collecting nltk
Successfully installed openai rouge-score nltk

Step by step

This example shows how to generate a model completion with gpt-4o and measure faithfulness by comparing the output to a reference answer using rouge-score.

python

import os
from openai import OpenAI
from rouge_score import rouge_scorer

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Reference factual text
reference = "The Eiffel Tower is located in Paris, France."

# Prompt to generate a factual statement
prompt = "Where is the Eiffel Tower located?"

# Generate completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

output = response.choices[0].message.content.strip()
print("Model output:", output)

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

# Score model output against reference
scores = scorer.score(reference, output)
print(f"ROUGE-L F1 score: {scores['rougeL'].fmeasure:.3f}")

output

Model output: The Eiffel Tower is in Paris, France.
ROUGE-L F1 score: 0.923

Common variations

You can measure faithfulness with other metrics like BLEU using nltk, or perform factual consistency checks by prompting the model to verify its own output.

For async usage, use asyncio with the OpenAI SDK's async methods. You can also test different models like claude-3-5-sonnet-20241022 or gemini-2.5-pro for comparison.

python

import os
import asyncio
from openai import OpenAI
from nltk.translate.bleu_score import sentence_bleu

async def measure_bleu():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    prompt = "Who wrote 'Pride and Prejudice'?"
    reference = "Jane Austen wrote Pride and Prejudice."

    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )

    output = response.choices[0].message.content.strip()
    print("Model output:", output)

    # Tokenize for BLEU
    reference_tokens = [reference.split()]
    output_tokens = output.split()

    bleu_score = sentence_bleu(reference_tokens, output_tokens)
    print(f"BLEU score: {bleu_score:.3f}")

asyncio.run(measure_bleu())

output

Model output: Jane Austen is the author of Pride and Prejudice.
BLEU score: 0.750

Troubleshooting

If ROUGE or BLEU scores are unexpectedly low, verify that the reference and output texts are comparable in length and content.
If API calls fail, check your OPENAI_API_KEY environment variable and network connectivity.
For inconsistent model outputs, try setting temperature=0 in the API call to reduce randomness.

✅

Key Takeaways

Use automated metrics like ROUGE and BLEU to quantitatively measure LLM faithfulness against reference texts.
Leverage Python libraries and OpenAI SDK v1+ for easy integration and evaluation.
Set temperature=0 to improve output consistency during faithfulness testing.
Async API calls enable scalable batch evaluation of multiple prompts and references.
Always compare outputs to verified ground truth to accurately assess factual correctness.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-2.5-pro

Verify ↗