AssertionError
builtins.AssertionError
Stack trace
_____________________________ test_llm_response _____________________________
def test_llm_response():
response = llm.generate(prompt)
> assert response.text == expected_text
E AssertionError: assert 'Hello, world!' == 'Hello world!'
E - Hello, world!
E + Hello world!
test_llm.py:15: AssertionError Why it happens
LLMs produce probabilistic outputs that can vary slightly on each call, causing exact string assertions in tests to fail intermittently. This nondeterminism is common when prompts are not tightly controlled or when the model sampling parameters allow variability.
Detection
Monitor test runs for intermittent assertion failures on LLM outputs and log raw responses to identify variability patterns before flaky failures block CI pipelines.
Causes & fixes
LLM outputs vary slightly between calls due to sampling randomness
Set deterministic parameters like temperature=0 and top_p=1 in the LLM call to reduce output variability during tests
Test asserts exact string match on LLM output without normalization
Normalize outputs by stripping whitespace, lowercasing, or using regex to allow minor variations in test assertions
Prompt instructions are ambiguous or incomplete, causing inconsistent LLM responses
Refine prompts to explicitly specify output format and content to improve response consistency
No retry or tolerance mechanism in test for transient LLM output differences
Implement retry logic with a few attempts or fuzzy matching in tests to tolerate minor output fluctuations
Code: broken vs fixed
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Say hello"
expected_text = "Hello world!"
response = client.chat.completions.create(model="gpt-4o-mini", messages=[{"role": "user", "content": prompt}])
# This assertion is flaky due to output variability
assert response.choices[0].message.content == expected_text import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
prompt = "Say hello"
expected_text = "hello world!"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0, # Added to reduce randomness
top_p=1
)
output = response.choices[0].message.content.strip().lower() # Normalize output
assert output == expected_text # Fixed flaky assertion Workaround
Wrap the assertion in a try/except block and retry the LLM call up to 3 times before failing the test to mitigate transient output differences.
Prevention
Design tests to use deterministic LLM parameters and validate outputs with flexible matching or schema validation instead of exact string equality to avoid flaky failures.