Handle non-deterministic test outputs
Quick answer
To handle non-deterministic test outputs in AI testing, use techniques like output normalization, running multiple test iterations, and applying tolerance thresholds to compare results. Employ
assert statements with fuzzy matching or statistical checks to accommodate variability in model responses.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
- Install OpenAI SDK:
pip install openai - Set environment variable:
export OPENAI_API_KEY='your_api_key'(Linux/macOS) orsetx OPENAI_API_KEY "your_api_key"(Windows)
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example demonstrates handling non-deterministic outputs by running multiple completions and comparing normalized results with a tolerance threshold.
import os
from openai import OpenAI
import difflib
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Function to normalize output text
def normalize_text(text: str) -> str:
return text.strip().lower()
# Function to compare two texts with similarity threshold
def is_similar(text1: str, text2: str, threshold: float = 0.85) -> bool:
ratio = difflib.SequenceMatcher(None, text1, text2).ratio()
return ratio >= threshold
# Run multiple completions to handle variability
prompts = ["What is the capital of France?"] * 3
outputs = []
for prompt in prompts:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
text = response.choices[0].message.content
outputs.append(normalize_text(text))
# Compare outputs pairwise
for i in range(len(outputs)):
for j in range(i + 1, len(outputs)):
similar = is_similar(outputs[i], outputs[j])
print(f"Output {i+1} and Output {j+1} similar? {similar}")
# Assert at least two outputs are similar
assert any(is_similar(outputs[i], outputs[j]) for i in range(len(outputs)) for j in range(i+1, len(outputs))), "Outputs are too divergent" output
Output 1 and Output 2 similar? True Output 1 and Output 3 similar? True Output 2 and Output 3 similar? True
Common variations
You can handle non-determinism with these variations:
- Use
asynccalls for parallel requests to speed up multiple runs. - Apply different similarity metrics like Levenshtein distance or embedding cosine similarity.
- Use a different model such as
gpt-4o-minifor faster but less variable outputs.
import asyncio
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def get_response(prompt: str) -> str:
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content.strip().lower()
async def main():
prompt = "What is the capital of France?"
tasks = [get_response(prompt) for _ in range(5)]
results = await asyncio.gather(*tasks)
for i, output in enumerate(results, 1):
print(f"Output {i}: {output}")
if __name__ == "__main__":
asyncio.run(main()) output
Output 1: paris Output 2: paris Output 3: paris Output 4: paris Output 5: paris
Troubleshooting
If outputs vary too much, try these fixes:
- Increase
temperatureparameter to reduce randomness (e.g.,temperature=0.0for deterministic output). - Use output normalization to ignore whitespace, case, or punctuation differences.
- Run more iterations to statistically verify expected output patterns.
- Check API rate limits or errors if responses are inconsistent or missing.
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is the capital of France?"}],
temperature=0.0
)
print(response.choices[0].message.content.strip()) output
Paris
Key Takeaways
- Normalize AI outputs before comparison to handle minor variations.
- Run multiple test iterations and compare results statistically.
- Set
temperature=0.0for more deterministic AI responses. - Use fuzzy matching or similarity metrics to assert test correctness.
- Async calls speed up multiple test runs for non-deterministic outputs.