How to Intermediate · 3 min read

How to build a prompt regression test suite

Quick answer

Build a prompt regression test suite by saving expected outputs for key prompts and automatically comparing new LLM responses against them using code. Use OpenAI or Anthropic SDKs to generate outputs, then assert consistency to catch regressions.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

Create a Python script that defines a set of prompts with their expected outputs. Use the OpenAI SDK to generate current outputs and compare them to the saved expected results. Flag any differences as regressions.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define test cases with prompts and expected outputs
prompt_tests = {
    "greeting": {
        "prompt": "Say hello in a friendly way.",
        "expected": "Hello! How can I help you today?"
    },
    "farewell": {
        "prompt": "Say goodbye politely.",
        "expected": "Goodbye! Have a great day!"
    }
}

def run_regression_tests():
    failures = []
    for test_name, test_data in prompt_tests.items():
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": test_data["prompt"]}]
        )
        output = response.choices[0].message.content.strip()
        if output != test_data["expected"]:
            failures.append((test_name, output, test_data["expected"]))
    if failures:
        print("Regression test failures detected:")
        for name, got, expected in failures:
            print(f"- {name}:\n  Got: {got}\n  Expected: {expected}\n")
    else:
        print("All regression tests passed.")

if __name__ == "__main__":
    run_regression_tests()

output

All regression tests passed.

Common variations

You can extend the suite by adding async calls, using different models like claude-3-5-haiku-20241022, or integrating with CI pipelines. For async, use asyncio with the SDK's async methods. For streaming, capture partial outputs and compare after completion.

python

import os
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def run_async_test(prompt, expected):
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    output = response.choices[0].message.content.strip()
    assert output == expected, f"Expected '{expected}', got '{output}'"

async def main():
    await run_async_test("Say hello in a friendly way.", "Hello! How can I help you today?")

if __name__ == "__main__":
    asyncio.run(main())

Troubleshooting

If tests fail due to minor output changes, consider using fuzzy matching or semantic similarity instead of exact string match.
If API rate limits occur, add retries with exponential backoff.
Ensure environment variables are set correctly to avoid authentication errors.

Key Takeaways

Save expected outputs for key prompts to detect unintended changes in LLM behavior.
Automate prompt testing with SDK calls and assert output consistency in your CI pipeline.
Use exact or fuzzy matching depending on your tolerance for minor output variations.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.