How to measure LLM faithfulness
Quick answer
Measure LLM faithfulness by comparing the model's generated outputs against verified ground truth using metrics like
ROUGE, BLEU, and factual consistency checks. Use Python libraries and AI APIs to automate evaluation with reference texts and prompt engineering.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0pip install rouge-scorepip install nltk
Setup
Install required Python packages and set your OpenAI API key as an environment variable.
- Install packages:
openaifor API calls,rouge-scoreandnltkfor evaluation metrics. - Set environment variable
OPENAI_API_KEYwith your API key.
pip install openai rouge-score nltk output
Collecting openai Collecting rouge-score Collecting nltk Successfully installed openai rouge-score nltk
Step by step
This example shows how to generate a model completion with gpt-4o and measure faithfulness by comparing the output to a reference answer using rouge-score.
import os
from openai import OpenAI
from rouge_score import rouge_scorer
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Reference factual text
reference = "The Eiffel Tower is located in Paris, France."
# Prompt to generate a factual statement
prompt = "Where is the Eiffel Tower located?"
# Generate completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip()
print("Model output:", output)
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
# Score model output against reference
scores = scorer.score(reference, output)
print(f"ROUGE-L F1 score: {scores['rougeL'].fmeasure:.3f}") output
Model output: The Eiffel Tower is in Paris, France. ROUGE-L F1 score: 0.923
Common variations
You can measure faithfulness with other metrics like BLEU using nltk, or perform factual consistency checks by prompting the model to verify its own output.
For async usage, use asyncio with the OpenAI SDK's async methods. You can also test different models like claude-3-5-sonnet-20241022 or gemini-2.5-pro for comparison.
import os
import asyncio
from openai import OpenAI
from nltk.translate.bleu_score import sentence_bleu
async def measure_bleu():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Who wrote 'Pride and Prejudice'?"
reference = "Jane Austen wrote Pride and Prejudice."
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip()
print("Model output:", output)
# Tokenize for BLEU
reference_tokens = [reference.split()]
output_tokens = output.split()
bleu_score = sentence_bleu(reference_tokens, output_tokens)
print(f"BLEU score: {bleu_score:.3f}")
asyncio.run(measure_bleu()) output
Model output: Jane Austen is the author of Pride and Prejudice. BLEU score: 0.750
Troubleshooting
- If ROUGE or BLEU scores are unexpectedly low, verify that the reference and output texts are comparable in length and content.
- If API calls fail, check your
OPENAI_API_KEYenvironment variable and network connectivity. - For inconsistent model outputs, try setting
temperature=0in the API call to reduce randomness.
Key Takeaways
- Use automated metrics like ROUGE and BLEU to quantitatively measure LLM faithfulness against reference texts.
- Leverage Python libraries and OpenAI SDK v1+ for easy integration and evaluation.
- Set
temperature=0to improve output consistency during faithfulness testing. - Async API calls enable scalable batch evaluation of multiple prompts and references.
- Always compare outputs to verified ground truth to accurately assess factual correctness.