LLMJudgeInconsistentScoringError
ai_testing.errors.LLMJudgeInconsistentScoringError
Stack trace
ai_testing.errors.LLMJudgeInconsistentScoringError: Inconsistent scoring detected: expected score 0.8 but got 0.5 for input ID 12345
File "/app/ai_testing/judge.py", line 87, in evaluate
raise LLMJudgeInconsistentScoringError(f"Inconsistent scoring detected: expected score {expected} but got {actual} for input ID {input_id}") Why it happens
This error occurs when the LLM judge produces different scores for the same input across multiple evaluations. It usually happens because the prompt lacks strict instructions to produce deterministic scoring, or the model used is not instruction-tuned to follow scoring guidelines consistently.
Detection
Monitor scoring outputs for repeated inputs and assert score consistency; log discrepancies and raise alerts when scores differ beyond a threshold.
Causes & fixes
Prompt instructions are ambiguous or incomplete, causing the LLM to interpret scoring criteria differently each time.
Revise the prompt to include explicit, unambiguous scoring criteria and examples to guide the LLM judge towards consistent scoring.
Using a base LLM model that is not instruction-tuned, leading to variability in output scoring.
Switch to an instruction-tuned model such as gpt-4o-mini or claude-3-5-haiku-20241022 that reliably follows scoring instructions.
The scoring function does not normalize or round scores, causing minor floating-point differences to trigger inconsistency errors.
Implement score normalization or rounding logic before comparison to tolerate minor floating-point variations.
Code: broken vs fixed
import os
from ai_testing import LLMJudge
judge = LLMJudge(model_name="base-llm")
score1 = judge.score(input_text)
score2 = judge.score(input_text)
if score1 != score2:
raise Exception("Inconsistent scoring detected") # triggers LLMJudgeInconsistentScoringError import os
from ai_testing import LLMJudge
os.environ["AI_TESTING_API_KEY"] = os.environ["OPENAI_API_KEY"]
# Use instruction-tuned model and explicit scoring instructions
judge = LLMJudge(model_name="gpt-4o-mini", prompt_template="Score the input strictly between 0 and 1 with two decimals.")
score1 = round(judge.score(input_text), 2)
score2 = round(judge.score(input_text), 2)
if score1 != score2:
raise Exception("Inconsistent scoring detected") # fixed by rounding and better model Workaround
Catch the inconsistency exception and retry scoring multiple times, then take the median or mode of the scores as a fallback.
Prevention
Design prompts with strict, clear scoring criteria and use instruction-tuned LLMs; implement score normalization to avoid false inconsistencies.