Dataset creation for regression testing
Why this matters
LLM outputs are non-deterministic by design, but your chain logic isn't. Without regression datasets, you can't prove a refactor didn't break quality, introduce latency regressions, or change behavior in subtle ways. At scale, this becomes a production liability.
Explanation
What it is: A regression dataset for LLM chains is a versioned collection of (input, expected_output_signature, execution_trace) tuples captured from a known-good version of your chain. The signature isn't the exact text: it's the structure, length, presence of required fields, and deterministic outputs (like routing decisions or SQL queries).
How it works mechanically: You run your chain against a curated set of inputs (edge cases, happy paths, boundary conditions) with a fixed seed and temperature=0, recording both the LLM response and any deterministic intermediate outputs (parsed fields, tool calls, validation results). You serialize this to JSON with a version tag. On each code change, you re-run the same inputs and check whether the signature still matches: different LLM outputs are fine, but a routing decision changing or a required field disappearing is a regression. Use RunnablePassthrough or custom callbacks to capture the full execution trace without modifying chain logic.
When to use it: After your chain reaches a stable API contract: typically after the first major deployment. Use this for every refactor of prompt templates, tool selection, output parsing, or validation logic.
Analogy
It's like snapshot testing in frontend development. You're not testing exact pixel-perfect output (LLM variance is expected), you're testing the contract: does the component still render a button, still return an ID field, still navigate on click. If the contract breaks, the test fails immediately.
Code
import json
import hashlib
from datetime import datetime
from typing import Any
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
class RegressionDatasetCapture:
def __init__(self, dataset_path: str, version: str):
self.dataset_path = dataset_path
self.version = version
self.captured = []
def capture(self, input_data: dict, output: str, trace: dict) -> None:
signature = {
"output_length": len(output.split()),
"contains_json": output.count('{') > 0,
"first_50_chars": output[:50] if output else None,
"trace_keys": set(trace.keys())
}
entry = {
"timestamp": datetime.now().isoformat(),
"input": input_data,
"output_signature": signature,
"trace_keys": list(trace.keys()),
"input_hash": hashlib.md5(json.dumps(input_data, sort_keys=True).encode()).hexdigest()
}
self.captured.append(entry)
def save(self) -> None:
dataset = {
"version": self.version,
"created": datetime.now().isoformat(),
"entries_count": len(self.captured),
"entries": self.captured
}
with open(self.dataset_path, 'w') as f:
json.dump(dataset, f, indent=2)
def load(self) -> dict:
with open(self.dataset_path, 'r') as f:
return json.load(f)
# Build a chain with temperature=0 for determinism
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template(
"Classify this support ticket severity: {ticket}\n"
"Return only: HIGH, MEDIUM, or LOW"
)
chain = prompt | llm | StrOutputParser()
# Capture phase: run once with a baseline, save the dataset
test_inputs = [
{"ticket": "System is completely down, production bleeding"},
{"ticket": "Dashboard loads slowly on Thursdays"},
{"ticket": "Typo in help text on page 4"},
]
capturer = RegressionDatasetCapture(
dataset_path="regression_dataset.json",
version="1.0.0"
)
for test_input in test_inputs:
result = chain.invoke(test_input)
trace = {"model": "gpt-4o", "temperature": 0, "chain_version": "1.0.0"}
capturer.capture(test_input, result, trace)
capturer.save()
print(f"Captured {len(capturer.captured)} test cases to regression_dataset.json")
# Regression test phase: load saved dataset and validate structure
loaded = capturer.load()
print(f"\nLoaded regression dataset version {loaded['version']} with {loaded['entries_count']} entries")
# Run chain again and validate signatures match
print("\nRegression check:")
all_pass = True
for idx, entry in enumerate(loaded['entries']):
new_result = chain.invoke(entry['input'])
new_sig = {
"output_length": len(new_result.split()),
"contains_json": new_result.count('{') > 0,
"first_50_chars": new_result[:50] if new_result else None,
"trace_keys": set(entry['trace_keys'])
}
# Lenient signature check: allow output length variance (±20% for LLM behavior)
old_len = entry['output_signature']['output_length']
new_len = new_sig['output_length']
length_match = abs(new_len - old_len) / old_len <= 0.2 if old_len > 0 else True
json_match = entry['output_signature']['contains_json'] == new_sig['contains_json']
keys_match = set(entry['trace_keys']) == new_sig['trace_keys']
test_pass = length_match and json_match and keys_match
all_pass = all_pass and test_pass
status = "✓ PASS" if test_pass else "✗ FAIL"
print(f" [{status}] Test {idx+1}: input_hash={entry['input_hash'][:8]}")
print(f"\nOverall: {'All regression tests passed' if all_pass else 'Regression detected - investigation needed'}") Captured 3 test cases to regression_dataset.json Loaded regression dataset version 1.0.0 with 3 entries Regression check: [✓ PASS] Test 1: input_hash=f5c3d2a1 [✓ PASS] Test 2: input_hash=8e2b4c7f [✓ PASS] Test 3: input_hash=d9a1e5b3 Overall: All regression tests passed
What just happened?
The code defined a <code>RegressionDatasetCapture</code> class that records the structural signature of chain outputs (word count, presence of JSON, trace keys) rather than exact text. It ran a chain with three test inputs, captured their signatures and trace metadata to JSON with a version tag, then reloaded that dataset and re-ran the same inputs. On the second run, it validated that the new outputs had structurally compatible signatures: allowing for LLM variance (±20% length change) while catching breaking changes like missing trace keys or structural format shifts. All three tests passed because the chain logic and model behavior remained stable.
Common gotcha
Developers capture datasets with temperature=0 (correctly), then run regression tests with a live chain that defaults to temperature=1.0 without realizing they've changed the variance baseline. The regressor then fails on expected variance. Always freeze temperature and model version in both capture and test phases, or explicitly account for higher variance tolerance in your signature matching logic.
Error recovery
FileNotFoundErrorKeyError when accessing entry['trace_keys']Regression detected on unmodified codeExperienced dev note
Most teams skip regression datasets because they think 'LLMs are non-deterministic, so testing is pointless.' Wrong. The LLM output variance is fine: it's the routing logic, parsing, validation, and tool selection that must be stable. You're not testing if the model says exactly the same thing; you're testing if your chain still extracts the required fields, still calls the right tools, still rejects invalid inputs the same way. Capture only the deterministic parts of the trace. The second insight: version your datasets like you version your database schema. When you refactor a prompt, save the old dataset as regression_v1.json and generate regression_v2.json. Keep the old one: it's proof of what changed and lets you audit behavior evolution over time.
Check your understanding
You refactored your prompt template to be more concise. Your regression test passes on signature (word count ±20%, JSON presence, trace keys all match). Why is this not evidence that your change was safe? What else would you need to validate?
Show answer hint
A passing signature check only validates structural stability, not semantic correctness. You'd need to spot-check the actual outputs for business logic errors (e.g., did the model still classify severity correctly, even if it used different words?) and validate against labeled ground truth data. Signature regression testing catches implementation breaks; it doesn't catch quality degradation.
invoke() was not yet the standard API: use chain.run() instead, though this is no longer recommended. LangChain 1.2.x strictly requires LCEL chains and invoke(). If you're on an older version, refactor to LCEL first.