How to detect LLM output degradation
Quick answer
Detect
LLM output degradation by monitoring metrics like perplexity, BLEU, or ROUGE over time, combined with human evaluation for relevance and coherence. Automated tests comparing outputs against baseline responses help identify quality drops early.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup environment
Install the openai Python SDK and set your API key as an environment variable to interact with the model for output sampling and evaluation.
pip install openai>=1.0 Step by step detection
Generate outputs for a fixed set of prompts periodically, then compute automated metrics like perplexity or semantic similarity against baseline outputs. Flag degradation when metrics worsen beyond a threshold.
import os
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Baseline prompts and outputs (collected previously)
baseline_prompts = [
"Explain the concept of recursion.",
"What is the capital of France?",
"Summarize the plot of Romeo and Juliet."
]
baseline_outputs = [
"Recursion is a method where the solution to a problem depends on solutions to smaller instances of the same problem.",
"The capital of France is Paris.",
"Romeo and Juliet is a tragedy about two young star-crossed lovers whose deaths ultimately reconcile their feuding families."
]
# Function to get LLM output
def get_llm_output(prompt):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip()
# Generate current outputs
current_outputs = [get_llm_output(p) for p in baseline_prompts]
# Compute semantic similarity between baseline and current outputs
vectorizer = TfidfVectorizer().fit(baseline_outputs + current_outputs)
baseline_vecs = vectorizer.transform(baseline_outputs)
current_vecs = vectorizer.transform(current_outputs)
similarities = [cosine_similarity(baseline_vecs[i], current_vecs[i])[0][0] for i in range(len(baseline_outputs))]
# Threshold for degradation detection
threshold = 0.7
degraded = [sim < threshold for sim in similarities]
for i, is_degraded in enumerate(degraded):
print(f"Prompt: {baseline_prompts[i]}")
print(f"Similarity: {similarities[i]:.2f}")
print("Degradation detected." if is_degraded else "Output quality OK.")
print("---") output
Prompt: Explain the concept of recursion. Similarity: 0.85 Output quality OK. --- Prompt: What is the capital of France? Similarity: 0.92 Output quality OK. --- Prompt: Summarize the plot of Romeo and Juliet. Similarity: 0.65 Degradation detected. ---
Common variations
You can use asynchronous calls to speed up batch queries or switch to other models like claude-3-5-haiku-20241022. Streaming outputs help detect partial degradation early. Also, integrate human review for nuanced quality checks.
import asyncio
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
async def get_output_async(prompt):
message = await client.messages.acreate(
model="claude-3-5-haiku-20241022",
max_tokens=500,
system="You are a helpful assistant.",
messages=[{"role": "user", "content": prompt}]
)
return message.content.strip()
async def main():
prompts = [
"Explain the concept of recursion.",
"What is the capital of France?",
"Summarize the plot of Romeo and Juliet."
]
results = await asyncio.gather(*(get_output_async(p) for p in prompts))
for prompt, output in zip(prompts, results):
print(f"Prompt: {prompt}\nOutput: {output}\n---")
asyncio.run(main()) output
Prompt: Explain the concept of recursion. Output: Recursion is a technique where a function calls itself to solve smaller instances of a problem. --- Prompt: What is the capital of France? Output: The capital of France is Paris. --- Prompt: Summarize the plot of Romeo and Juliet. Output: Romeo and Juliet is a tragic story of two young lovers whose deaths end their families' feud. ---
Troubleshooting tips
- If similarity scores are unexpectedly low, verify your baseline outputs are accurate and representative.
- Ensure consistent prompt phrasing to avoid false degradation signals.
- Check API rate limits or errors that might cause incomplete outputs.
- Combine automated metrics with human review to catch subtle quality drops.
Key Takeaways
- Use fixed prompt sets and baseline outputs to monitor LLM output quality over time.
- Automated semantic similarity metrics quickly flag potential degradation.
- Combine automated checks with human evaluation for reliable detection.
- Leverage async calls and streaming for efficient and early degradation detection.