High severity intermediate · Fix: 5-10 min

AssertionError

builtins.AssertionError

What this error means
Pytest assertion failures occur intermittently due to nondeterministic LLM outputs causing flaky test results.

Stack trace

traceback
_____________________________ test_llm_response _____________________________

    def test_llm_response():
        response = llm.generate(prompt)
>       assert response.text == expected_text
E       AssertionError: assert 'Hello, world!' == 'Hello world!'
E         - Hello, world!
E         + Hello world!

test_llm.py:15: AssertionError
QUICK FIX
Set temperature=0 in LLM calls and normalize outputs before asserting equality in pytest.

Why it happens

LLMs produce probabilistic outputs that can vary slightly on each call, causing exact string assertions in tests to fail intermittently. This nondeterminism is common when prompts are not tightly controlled or when the model sampling parameters allow variability.

Detection

Monitor test runs for intermittent assertion failures on LLM outputs and log raw responses to identify variability patterns before flaky failures block CI pipelines.

Causes & fixes

1

LLM outputs vary slightly between calls due to sampling randomness

✓ Fix

Set deterministic parameters like temperature=0 and top_p=1 in the LLM call to reduce output variability during tests

2

Test asserts exact string match on LLM output without normalization

✓ Fix

Normalize outputs by stripping whitespace, lowercasing, or using regex to allow minor variations in test assertions

3

Prompt instructions are ambiguous or incomplete, causing inconsistent LLM responses

✓ Fix

Refine prompts to explicitly specify output format and content to improve response consistency

4

No retry or tolerance mechanism in test for transient LLM output differences

✓ Fix

Implement retry logic with a few attempts or fuzzy matching in tests to tolerate minor output fluctuations

Code: broken vs fixed

Broken - triggers the error
python
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "Say hello"
expected_text = "Hello world!"

response = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])

# This assertion is flaky due to output variability
assert response.choices[0].message.content == expected_text
Fixed - works correctly
python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "Say hello"
expected_text = "hello world!"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,  # Added to reduce randomness
    top_p=1
)

output = response.choices[0].message.content.strip().lower()  # Normalize output
assert output == expected_text  # Fixed flaky assertion
Added temperature=0 and top_p=1 to make LLM output deterministic and normalized output string before assertion to prevent flaky test failures.

Workaround

Wrap the assertion in a try/except block and retry the LLM call up to 3 times before failing the test to mitigate transient output differences.

Prevention

Design tests to use deterministic LLM parameters and validate outputs with flexible matching or schema validation instead of exact string equality to avoid flaky failures.

Python 3.9+ · openai >=1.0.0 · tested on 1.x
Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.