How to Intermediate · 3 min read

Fix flaky LLM tests

Q: Fix flaky LLM tests

To fix flaky LLM tests, use temperature=0 in your API calls to ensure deterministic outputs and design stable, clear prompts. Additionally, implement retries with backoff to handle transient API variability and network issues.

Quick answer

To fix flaky LLM tests, use temperature=0 in your API calls to ensure deterministic outputs and design stable, clear prompts. Additionally, implement retries with backoff to handle transient API variability and network issues.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the latest openai Python SDK and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

output

Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z)

Step by step

Use a fixed temperature=0 to get deterministic completions and add a retry mechanism to handle transient failures or minor output variations.

python

import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

PROMPT = "Translate the following English sentence to French: 'Hello, how are you?'"

# Retry wrapper

def call_llm_with_retries(prompt, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay)

if __name__ == "__main__":
    translation = call_llm_with_retries(PROMPT)
    print(f"Translation: {translation}")

output

Translation: Bonjour, comment ça va ?

Common variations

Use temperature=0 for deterministic outputs or slightly higher for creativity.
For asynchronous tests, use async client calls with retries.
Test with different models like gpt-4o-mini for faster, cheaper runs.

python

import os
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_call_llm():
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Say hello in Spanish."}],
        temperature=0
    )
    print("Async response:", response.choices[0].message.content.strip())

if __name__ == "__main__":
    asyncio.run(async_call_llm())

output

Async response: Hola

Troubleshooting

If tests fail intermittently, verify network stability and API rate limits.
Check for prompt changes or model updates that may affect output.
Use snapshot testing with normalized outputs to reduce false negatives.

✅

Key Takeaways

Set temperature=0 for deterministic LLM outputs to reduce test flakiness.
Implement retries with backoff to handle transient API errors or network issues.
Use stable, clear prompts and snapshot testing to catch meaningful changes only.

Verified 2026-04 · gpt-4o-mini

Verify ↗