How to build a prompt regression test suite
Quick answer
Build a prompt regression test suite by saving expected outputs for key prompts and automatically comparing new LLM responses against them using code. Use OpenAI or Anthropic SDKs to generate outputs, then assert consistency to catch regressions.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
Create a Python script that defines a set of prompts with their expected outputs. Use the OpenAI SDK to generate current outputs and compare them to the saved expected results. Flag any differences as regressions.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define test cases with prompts and expected outputs
prompt_tests = {
"greeting": {
"prompt": "Say hello in a friendly way.",
"expected": "Hello! How can I help you today?"
},
"farewell": {
"prompt": "Say goodbye politely.",
"expected": "Goodbye! Have a great day!"
}
}
def run_regression_tests():
failures = []
for test_name, test_data in prompt_tests.items():
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": test_data["prompt"]}]
)
output = response.choices[0].message.content.strip()
if output != test_data["expected"]:
failures.append((test_name, output, test_data["expected"]))
if failures:
print("Regression test failures detected:")
for name, got, expected in failures:
print(f"- {name}:\n Got: {got}\n Expected: {expected}\n")
else:
print("All regression tests passed.")
if __name__ == "__main__":
run_regression_tests() output
All regression tests passed.
Common variations
You can extend the suite by adding async calls, using different models like claude-3-5-haiku-20241022, or integrating with CI pipelines. For async, use asyncio with the SDK's async methods. For streaming, capture partial outputs and compare after completion.
import os
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def run_async_test(prompt, expected):
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
output = response.choices[0].message.content.strip()
assert output == expected, f"Expected '{expected}', got '{output}'"
async def main():
await run_async_test("Say hello in a friendly way.", "Hello! How can I help you today?")
if __name__ == "__main__":
asyncio.run(main()) Troubleshooting
- If tests fail due to minor output changes, consider using fuzzy matching or semantic similarity instead of exact string match.
- If API rate limits occur, add retries with exponential backoff.
- Ensure environment variables are set correctly to avoid authentication errors.
Key Takeaways
- Save expected outputs for key prompts to detect unintended changes in LLM behavior.
- Automate prompt testing with SDK calls and assert output consistency in your CI pipeline.
- Use exact or fuzzy matching depending on your tolerance for minor output variations.