How to Intermediate · 3 min read

Handle non-deterministic test outputs

Quick answer
To handle non-deterministic test outputs in AI testing, use techniques like output normalization, running multiple test iterations, and applying tolerance thresholds to compare results. Employ assert statements with fuzzy matching or statistical checks to accommodate variability in model responses.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

  • Install OpenAI SDK: pip install openai
  • Set environment variable: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows)
bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates handling non-deterministic outputs by running multiple completions and comparing normalized results with a tolerance threshold.

python
import os
from openai import OpenAI
import difflib

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Function to normalize output text

def normalize_text(text: str) -> str:
    return text.strip().lower()

# Function to compare two texts with similarity threshold

def is_similar(text1: str, text2: str, threshold: float = 0.85) -> bool:
    ratio = difflib.SequenceMatcher(None, text1, text2).ratio()
    return ratio >= threshold

# Run multiple completions to handle variability

prompts = ["What is the capital of France?"] * 3
outputs = []

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    text = response.choices[0].message.content
    outputs.append(normalize_text(text))

# Compare outputs pairwise

for i in range(len(outputs)):
    for j in range(i + 1, len(outputs)):
        similar = is_similar(outputs[i], outputs[j])
        print(f"Output {i+1} and Output {j+1} similar? {similar}")

# Assert at least two outputs are similar

assert any(is_similar(outputs[i], outputs[j]) for i in range(len(outputs)) for j in range(i+1, len(outputs))), "Outputs are too divergent"
output
Output 1 and Output 2 similar? True
Output 1 and Output 3 similar? True
Output 2 and Output 3 similar? True

Common variations

You can handle non-determinism with these variations:

  • Use async calls for parallel requests to speed up multiple runs.
  • Apply different similarity metrics like Levenshtein distance or embedding cosine similarity.
  • Use a different model such as gpt-4o-mini for faster but less variable outputs.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def get_response(prompt: str) -> str:
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip().lower()

async def main():
    prompt = "What is the capital of France?"
    tasks = [get_response(prompt) for _ in range(5)]
    results = await asyncio.gather(*tasks)
    for i, output in enumerate(results, 1):
        print(f"Output {i}: {output}")

if __name__ == "__main__":
    asyncio.run(main())
output
Output 1: paris
Output 2: paris
Output 3: paris
Output 4: paris
Output 5: paris

Troubleshooting

If outputs vary too much, try these fixes:

  • Increase temperature parameter to reduce randomness (e.g., temperature=0.0 for deterministic output).
  • Use output normalization to ignore whitespace, case, or punctuation differences.
  • Run more iterations to statistically verify expected output patterns.
  • Check API rate limits or errors if responses are inconsistent or missing.
python
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    temperature=0.0
)
print(response.choices[0].message.content.strip())
output
Paris

Key Takeaways

  • Normalize AI outputs before comparison to handle minor variations.
  • Run multiple test iterations and compare results statistically.
  • Set temperature=0.0 for more deterministic AI responses.
  • Use fuzzy matching or similarity metrics to assert test correctness.
  • Async calls speed up multiple test runs for non-deterministic outputs.
Verified 2026-04 · gpt-4o-mini
Verify ↗