How to Intermediate · 3 min read

Fix flaky LLM tests

Quick answer
To fix flaky LLM tests, use temperature=0 in your API calls to ensure deterministic outputs and design stable, clear prompts. Additionally, implement retries with backoff to handle transient API variability and network issues.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the latest openai Python SDK and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0
output
Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (x.y.z)

Step by step

Use a fixed temperature=0 to get deterministic completions and add a retry mechanism to handle transient failures or minor output variations.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

PROMPT = "Translate the following English sentence to French: 'Hello, how are you?'"

# Retry wrapper

def call_llm_with_retries(prompt, max_retries=3, delay=1):
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(delay)

if __name__ == "__main__":
    translation = call_llm_with_retries(PROMPT)
    print(f"Translation: {translation}")
output
Translation: Bonjour, comment ça va ?

Common variations

  • Use temperature=0 for deterministic outputs or slightly higher for creativity.
  • For asynchronous tests, use async client calls with retries.
  • Test with different models like gpt-4o-mini for faster, cheaper runs.
python
import os
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def async_call_llm():
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": "Say hello in Spanish."}],
        temperature=0
    )
    print("Async response:", response.choices[0].message.content.strip())

if __name__ == "__main__":
    asyncio.run(async_call_llm())
output
Async response: Hola

Troubleshooting

  • If tests fail intermittently, verify network stability and API rate limits.
  • Check for prompt changes or model updates that may affect output.
  • Use snapshot testing with normalized outputs to reduce false negatives.

Key Takeaways

  • Set temperature=0 for deterministic LLM outputs to reduce test flakiness.
  • Implement retries with backoff to handle transient API errors or network issues.
  • Use stable, clear prompts and snapshot testing to catch meaningful changes only.
Verified 2026-04 · gpt-4o-mini
Verify ↗