High severity intermediate · Fix: 5-10 min

LLMEvaluationMetricNotDeterministicError

ai_testing.evaluation.LLMEvaluationMetricNotDeterministicError

What this error means

The LLM evaluation metric failed because the model outputs are non-deterministic, causing inconsistent metric results across runs.

Stack trace

traceback

ai_testing.evaluation.LLMEvaluationMetricNotDeterministicError: Evaluation metric returned inconsistent results due to non-deterministic LLM outputs
  File "/app/ai_testing/evaluation.py", line 87, in evaluate
    raise LLMEvaluationMetricNotDeterministicError("Metric results vary between runs")
  File "/app/ai_testing/evaluation.py", line 120, in run_evaluation
    results = evaluate(model_outputs)

QUICK FIX

Set LLM temperature=0 and fix random seeds before evaluation to ensure deterministic outputs.

Why it happens

LLM evaluation metrics require consistent outputs to produce stable scores. When the underlying LLM generates different outputs for the same input due to randomness or temperature settings, the evaluation metric results vary, triggering this error.

Detection

Monitor evaluation metric variance across repeated runs on the same input; large fluctuations indicate non-deterministic outputs causing metric instability.

Causes & fixes

LLM temperature or sampling parameters are set to non-zero values causing output randomness

✓ Fix

Set temperature and top_p parameters to 0 or use deterministic decoding methods like greedy decoding during evaluation.

Evaluation code does not fix random seeds for reproducibility

✓ Fix

Set random seeds for all relevant libraries (e.g., numpy, torch, random) before running evaluation to ensure consistent outputs.

Using a model version or API endpoint that does not guarantee deterministic outputs

✓ Fix

Switch to a model or API configuration that supports deterministic output generation or use cached outputs for evaluation.

Code: broken vs fixed

Broken - triggers the error

python

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Evaluate this text."}],
    temperature=0.7  # causes non-deterministic outputs
)

# This leads to inconsistent evaluation metric results

Fixed - works correctly

python

import os
from openai import OpenAI
import random
import numpy as np
import torch

# Fix random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Evaluate this text."}],
    temperature=0.0  # deterministic output
)

print(response.choices[0].message.content)  # deterministic output for stable evaluation

Set temperature to 0 and fix random seeds to ensure the LLM produces deterministic outputs, stabilizing evaluation metric results.

⚠

Workaround

Catch the LLMEvaluationMetricNotDeterministicError and rerun the evaluation multiple times, averaging results to reduce variance temporarily.

✓

Prevention

Always configure LLMs with deterministic decoding parameters and fix random seeds during evaluation to guarantee stable, reproducible metric results.

Python 3.9+ · ai-testing >=1.0.0 · tested on 1.2.0

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.