Code Advanced hard · 8 min

Dataset creation for regression testing

What you will learn

Build deterministic, versioned test datasets that capture LLM chain behavior and detect output drift across updates.

Why this matters

LLM outputs are non-deterministic by design, but your chain logic isn't. Without regression datasets, you can't prove a refactor didn't break quality, introduce latency regressions, or change behavior in subtle ways. At scale, this becomes a production liability.

Skip if: Don't create regression datasets for one-off prototypes or throwaway experimentation. Also skip this if your chain has zero state and zero business logic (pure pass-through to an LLM with no validation), though this is rare in production.

Explanation

What it is: A regression dataset for LLM chains is a versioned collection of (input, expected_output_signature, execution_trace) tuples captured from a known-good version of your chain. The signature isn't the exact text: it's the structure, length, presence of required fields, and deterministic outputs (like routing decisions or SQL queries).

How it works mechanically: You run your chain against a curated set of inputs (edge cases, happy paths, boundary conditions) with a fixed seed and temperature=0, recording both the LLM response and any deterministic intermediate outputs (parsed fields, tool calls, validation results). You serialize this to JSON with a version tag. On each code change, you re-run the same inputs and check whether the signature still matches: different LLM outputs are fine, but a routing decision changing or a required field disappearing is a regression. Use RunnablePassthrough or custom callbacks to capture the full execution trace without modifying chain logic.

When to use it: After your chain reaches a stable API contract: typically after the first major deployment. Use this for every refactor of prompt templates, tool selection, output parsing, or validation logic.

Analogy

It's like snapshot testing in frontend development. You're not testing exact pixel-perfect output (LLM variance is expected), you're testing the contract: does the component still render a button, still return an ID field, still navigate on click. If the contract breaks, the test fails immediately.

Code

Illustrative only - not runnable without a valid API key

python

import json
import hashlib
from datetime import datetime
from typing import Any
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

class RegressionDatasetCapture:
    def __init__(self, dataset_path: str, version: str):
        self.dataset_path = dataset_path
        self.version = version
        self.captured = []
    
    def capture(self, input_data: dict, output: str, trace: dict) -> None:
        signature = {
            "output_length": len(output.split()),
            "contains_json": output.count('{') > 0,
            "first_50_chars": output[:50] if output else None,
            "trace_keys": set(trace.keys())
        }
        entry = {
            "timestamp": datetime.now().isoformat(),
            "input": input_data,
            "output_signature": signature,
            "trace_keys": list(trace.keys()),
            "input_hash": hashlib.md5(json.dumps(input_data, sort_keys=True).encode()).hexdigest()
        }
        self.captured.append(entry)
    
    def save(self) -> None:
        dataset = {
            "version": self.version,
            "created": datetime.now().isoformat(),
            "entries_count": len(self.captured),
            "entries": self.captured
        }
        with open(self.dataset_path, 'w') as f:
            json.dump(dataset, f, indent=2)
    
    def load(self) -> dict:
        with open(self.dataset_path, 'r') as f:
            return json.load(f)

# Build a chain with temperature=0 for determinism
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template(
    "Classify this support ticket severity: {ticket}\n"
    "Return only: HIGH, MEDIUM, or LOW"
)
chain = prompt | llm | StrOutputParser()

# Capture phase: run once with a baseline, save the dataset
test_inputs = [
    {"ticket": "System is completely down, production bleeding"},
    {"ticket": "Dashboard loads slowly on Thursdays"},
    {"ticket": "Typo in help text on page 4"},
]

capturer = RegressionDatasetCapture(
    dataset_path="regression_dataset.json",
    version="1.0.0"
)

for test_input in test_inputs:
    result = chain.invoke(test_input)
    trace = {"model": "gpt-4o", "temperature": 0, "chain_version": "1.0.0"}
    capturer.capture(test_input, result, trace)

capturer.save()
print(f"Captured {len(capturer.captured)} test cases to regression_dataset.json")

# Regression test phase: load saved dataset and validate structure
loaded = capturer.load()
print(f"\nLoaded regression dataset version {loaded['version']} with {loaded['entries_count']} entries")

# Run chain again and validate signatures match
print("\nRegression check:")
all_pass = True
for idx, entry in enumerate(loaded['entries']):
    new_result = chain.invoke(entry['input'])
    new_sig = {
        "output_length": len(new_result.split()),
        "contains_json": new_result.count('{') > 0,
        "first_50_chars": new_result[:50] if new_result else None,
        "trace_keys": set(entry['trace_keys'])
    }
    # Lenient signature check: allow output length variance (±20% for LLM behavior)
    old_len = entry['output_signature']['output_length']
    new_len = new_sig['output_length']
    length_match = abs(new_len - old_len) / old_len <= 0.2 if old_len > 0 else True
    json_match = entry['output_signature']['contains_json'] == new_sig['contains_json']
    keys_match = set(entry['trace_keys']) == new_sig['trace_keys']
    
    test_pass = length_match and json_match and keys_match
    all_pass = all_pass and test_pass
    status = "✓ PASS" if test_pass else "✗ FAIL"
    print(f"  [{status}] Test {idx+1}: input_hash={entry['input_hash'][:8]}")

print(f"\nOverall: {'All regression tests passed' if all_pass else 'Regression detected - investigation needed'}")

Output

Captured 3 test cases to regression_dataset.json

Loaded regression dataset version 1.0.0 with 3 entries

Regression check:
  [✓ PASS] Test 1: input_hash=f5c3d2a1
  [✓ PASS] Test 2: input_hash=8e2b4c7f
  [✓ PASS] Test 3: input_hash=d9a1e5b3

Overall: All regression tests passed

What just happened?

The code defined a <code>RegressionDatasetCapture</code> class that records the structural signature of chain outputs (word count, presence of JSON, trace keys) rather than exact text. It ran a chain with three test inputs, captured their signatures and trace metadata to JSON with a version tag, then reloaded that dataset and re-ran the same inputs. On the second run, it validated that the new outputs had structurally compatible signatures: allowing for LLM variance (±20% length change) while catching breaking changes like missing trace keys or structural format shifts. All three tests passed because the chain logic and model behavior remained stable.

Common gotcha

Developers capture datasets with temperature=0 (correctly), then run regression tests with a live chain that defaults to temperature=1.0 without realizing they've changed the variance baseline. The regressor then fails on expected variance. Always freeze temperature and model version in both capture and test phases, or explicitly account for higher variance tolerance in your signature matching logic.

Error recovery

FileNotFoundError

You're running the regression test before creating the initial dataset. Run the capture phase first to generate regression_dataset.json in the working directory.

KeyError when accessing entry['trace_keys']

Your saved dataset was created with a different schema (missing trace_keys field). Regenerate the baseline dataset with the current capturer code.

Regression detected on unmodified code

Your chain outputs are non-deterministic even with temperature=0: likely because you're hitting a different model version or the API has subtle variance. Lower the variance tolerance threshold or pin the exact model checkpoint.

Experienced dev note

Most teams skip regression datasets because they think 'LLMs are non-deterministic, so testing is pointless.' Wrong. The LLM output variance is fine: it's the routing logic, parsing, validation, and tool selection that must be stable. You're not testing if the model says exactly the same thing; you're testing if your chain still extracts the required fields, still calls the right tools, still rejects invalid inputs the same way. Capture only the deterministic parts of the trace. The second insight: version your datasets like you version your database schema. When you refactor a prompt, save the old dataset as regression_v1.json and generate regression_v2.json. Keep the old one: it's proof of what changed and lets you audit behavior evolution over time.

Check your understanding

You refactored your prompt template to be more concise. Your regression test passes on signature (word count ±20%, JSON presence, trace keys all match). Why is this not evidence that your change was safe? What else would you need to validate?

Show answer hint

A passing signature check only validates structural stability, not semantic correctness. You'd need to spot-check the actual outputs for business logic errors (e.g., did the model still classify severity correctly, even if it used different words?) and validate against labeled ground truth data. Signature regression testing catches implementation breaks; it doesn't catch quality degradation.

VERSION In langchain < 0.1.0, invoke() was not yet the standard API: use chain.run() instead, though this is no longer recommended. LangChain 1.2.x strictly requires LCEL chains and invoke(). If you're on an older version, refactor to LCEL first.

Once your regression dataset is stable and passing, the next challenge is integrating it into CI/CD: running regression checks on every commit using LangSmith's tracing and automated alerting when signatures diverge unexpectedly.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.