How to Intermediate · 3 min read

How to detect LLM output degradation

Q: How to detect LLM output degradation

Detect LLM output degradation by monitoring metrics like perplexity, BLEU, or ROUGE over time, combined with human evaluation for relevance and coherence. Automated tests comparing outputs against baseline responses help identify quality drops early.

Quick answer

Detect LLM output degradation by monitoring metrics like perplexity, BLEU, or ROUGE over time, combined with human evaluation for relevance and coherence. Automated tests comparing outputs against baseline responses help identify quality drops early.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup environment

Install the openai Python SDK and set your API key as an environment variable to interact with the model for output sampling and evaluation.

bash

pip install openai>=1.0

Step by step detection

Generate outputs for a fixed set of prompts periodically, then compute automated metrics like perplexity or semantic similarity against baseline outputs. Flag degradation when metrics worsen beyond a threshold.

python

import os
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Baseline prompts and outputs (collected previously)
baseline_prompts = [
    "Explain the concept of recursion.",
    "What is the capital of France?",
    "Summarize the plot of Romeo and Juliet."
]
baseline_outputs = [
    "Recursion is a method where the solution to a problem depends on solutions to smaller instances of the same problem.",
    "The capital of France is Paris.",
    "Romeo and Juliet is a tragedy about two young star-crossed lovers whose deaths ultimately reconcile their feuding families."
]

# Function to get LLM output

def get_llm_output(prompt):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()

# Generate current outputs
current_outputs = [get_llm_output(p) for p in baseline_prompts]

# Compute semantic similarity between baseline and current outputs
vectorizer = TfidfVectorizer().fit(baseline_outputs + current_outputs)
baseline_vecs = vectorizer.transform(baseline_outputs)
current_vecs = vectorizer.transform(current_outputs)

similarities = [cosine_similarity(baseline_vecs[i], current_vecs[i])[0][0] for i in range(len(baseline_outputs))]

# Threshold for degradation detection
threshold = 0.7

degraded = [sim < threshold for sim in similarities]

for i, is_degraded in enumerate(degraded):
    print(f"Prompt: {baseline_prompts[i]}")
    print(f"Similarity: {similarities[i]:.2f}")
    print("Degradation detected." if is_degraded else "Output quality OK.")
    print("---")

output

Prompt: Explain the concept of recursion.
Similarity: 0.85
Output quality OK.
---
Prompt: What is the capital of France?
Similarity: 0.92
Output quality OK.
---
Prompt: Summarize the plot of Romeo and Juliet.
Similarity: 0.65
Degradation detected.
---

Common variations

You can use asynchronous calls to speed up batch queries or switch to other models like claude-3-5-haiku-20241022. Streaming outputs help detect partial degradation early. Also, integrate human review for nuanced quality checks.

python

import asyncio
import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

async def get_output_async(prompt):
    message = await client.messages.acreate(
        model="claude-3-5-haiku-20241022",
        max_tokens=500,
        system="You are a helpful assistant.",
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content.strip()

async def main():
    prompts = [
        "Explain the concept of recursion.",
        "What is the capital of France?",
        "Summarize the plot of Romeo and Juliet."
    ]
    results = await asyncio.gather(*(get_output_async(p) for p in prompts))
    for prompt, output in zip(prompts, results):
        print(f"Prompt: {prompt}\nOutput: {output}\n---")

asyncio.run(main())

output

Prompt: Explain the concept of recursion.
Output: Recursion is a technique where a function calls itself to solve smaller instances of a problem.
---
Prompt: What is the capital of France?
Output: The capital of France is Paris.
---
Prompt: Summarize the plot of Romeo and Juliet.
Output: Romeo and Juliet is a tragic story of two young lovers whose deaths end their families' feud.
---

Troubleshooting tips

If similarity scores are unexpectedly low, verify your baseline outputs are accurate and representative.
Ensure consistent prompt phrasing to avoid false degradation signals.
Check API rate limits or errors that might cause incomplete outputs.
Combine automated metrics with human review to catch subtle quality drops.

✅

Key Takeaways

Use fixed prompt sets and baseline outputs to monitor LLM output quality over time.
Automated semantic similarity metrics quickly flag potential degradation.
Combine automated checks with human evaluation for reliable detection.
Leverage async calls and streaming for efficient and early degradation detection.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗