LLMEvaluationMetricNotDeterministicError
ai_testing.evaluation.LLMEvaluationMetricNotDeterministicError
Stack trace
ai_testing.evaluation.LLMEvaluationMetricNotDeterministicError: Evaluation metric returned inconsistent results due to non-deterministic LLM outputs
File "/app/ai_testing/evaluation.py", line 87, in evaluate
raise LLMEvaluationMetricNotDeterministicError("Metric results vary between runs")
File "/app/ai_testing/evaluation.py", line 120, in run_evaluation
results = evaluate(model_outputs)
Why it happens
LLM evaluation metrics require consistent outputs to produce stable scores. When the underlying LLM generates different outputs for the same input due to randomness or temperature settings, the evaluation metric results vary, triggering this error.
Detection
Monitor evaluation metric variance across repeated runs on the same input; large fluctuations indicate non-deterministic outputs causing metric instability.
Causes & fixes
LLM temperature or sampling parameters are set to non-zero values causing output randomness
Set temperature and top_p parameters to 0 or use deterministic decoding methods like greedy decoding during evaluation.
Evaluation code does not fix random seeds for reproducibility
Set random seeds for all relevant libraries (e.g., numpy, torch, random) before running evaluation to ensure consistent outputs.
Using a model version or API endpoint that does not guarantee deterministic outputs
Switch to a model or API configuration that supports deterministic output generation or use cached outputs for evaluation.
Code: broken vs fixed
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Evaluate this text."}],
temperature=0.7 # causes non-deterministic outputs
)
# This leads to inconsistent evaluation metric results import os
from openai import OpenAI
import random
import numpy as np
import torch
# Fix random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Evaluate this text."}],
temperature=0.0 # deterministic output
)
print(response.choices[0].message.content) # deterministic output for stable evaluation Workaround
Catch the LLMEvaluationMetricNotDeterministicError and rerun the evaluation multiple times, averaging results to reduce variance temporarily.
Prevention
Always configure LLMs with deterministic decoding parameters and fix random seeds during evaluation to guarantee stable, reproducible metric results.