High severity intermediate · Fix: 5-10 min

LLMJudgeInconsistentScoringError

ai_testing.errors.LLMJudgeInconsistentScoringError

What this error means
The LLM judge returned inconsistent scores for the same input, indicating a mismatch in expected scoring logic or prompt instructions.

Stack trace

traceback
ai_testing.errors.LLMJudgeInconsistentScoringError: Inconsistent scoring detected: expected score 0.8 but got 0.5 for input ID 12345
  File "/app/ai_testing/judge.py", line 87, in evaluate
    raise LLMJudgeInconsistentScoringError(f"Inconsistent scoring detected: expected score {expected} but got {actual} for input ID {input_id}")
QUICK FIX
Add explicit scoring instructions in the prompt and use an instruction-tuned model to ensure consistent LLM judge outputs.

Why it happens

This error occurs when the LLM judge produces different scores for the same input across multiple evaluations. It usually happens because the prompt lacks strict instructions to produce deterministic scoring, or the model used is not instruction-tuned to follow scoring guidelines consistently.

Detection

Monitor scoring outputs for repeated inputs and assert score consistency; log discrepancies and raise alerts when scores differ beyond a threshold.

Causes & fixes

1

Prompt instructions are ambiguous or incomplete, causing the LLM to interpret scoring criteria differently each time.

✓ Fix

Revise the prompt to include explicit, unambiguous scoring criteria and examples to guide the LLM judge towards consistent scoring.

2

Using a base LLM model that is not instruction-tuned, leading to variability in output scoring.

✓ Fix

Switch to an instruction-tuned model such as gpt-4o-mini or claude-3-5-haiku-20241022 that reliably follows scoring instructions.

3

The scoring function does not normalize or round scores, causing minor floating-point differences to trigger inconsistency errors.

✓ Fix

Implement score normalization or rounding logic before comparison to tolerate minor floating-point variations.

Code: broken vs fixed

Broken - triggers the error
python
import os
from ai_testing import LLMJudge

judge = LLMJudge(model_name="base-llm")
score1 = judge.score(input_text)
score2 = judge.score(input_text)
if score1 != score2:
    raise Exception("Inconsistent scoring detected")  # triggers LLMJudgeInconsistentScoringError
Fixed - works correctly
python
import os
from ai_testing import LLMJudge

os.environ["AI_TESTING_API_KEY"] = os.environ["OPENAI_API_KEY"]

# Use instruction-tuned model and explicit scoring instructions
judge = LLMJudge(model_name="gpt-4o-mini", prompt_template="Score the input strictly between 0 and 1 with two decimals.")
score1 = round(judge.score(input_text), 2)
score2 = round(judge.score(input_text), 2)
if score1 != score2:
    raise Exception("Inconsistent scoring detected")  # fixed by rounding and better model
Switched to an instruction-tuned model and added explicit prompt instructions with rounding to ensure consistent scoring outputs.

Workaround

Catch the inconsistency exception and retry scoring multiple times, then take the median or mode of the scores as a fallback.

Prevention

Design prompts with strict, clear scoring criteria and use instruction-tuned LLMs; implement score normalization to avoid false inconsistencies.

Python 3.9+ · ai-testing >=1.0.0 · tested on 1.2.3
Verified 2026-04
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.