How to Intermediate · 3 min read

LLM as judge bias and limitations

Quick answer
Using a large language model (LLM) as a judge introduces inherent bias from its training data and architecture, which can skew evaluation results. Limitations include lack of true understanding, sensitivity to prompt phrasing, and difficulty in handling nuanced or subjective judgments. Always combine LLM judgments with human oversight and diverse evaluation methods.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

This example shows how to use an LLM (gpt-4o) as a judge to evaluate two text completions and highlights potential bias and limitations in the output.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Two sample completions to judge
completion_a = "The capital of France is Paris."
completion_b = "Paris is the capital city of France."

# Prompt the LLM to judge which completion is better
prompt = f"You are a fair judge. Compare these two answers for correctness and clarity.\nAnswer A: {completion_a}\nAnswer B: {completion_b}\nWhich answer is better and why?"

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print("Judge's evaluation:")
print(response.choices[0].message.content)
output
Judge's evaluation:
Answer B is better because it is more complete and clear, explicitly stating that Paris is the capital city of France, which improves clarity and precision.

Common variations

You can use other models like claude-3-5-sonnet-20241022 or gemini-2.5-pro for judging. Async calls and streaming responses are also possible depending on the SDK. Prompt engineering is critical to reduce bias by specifying fairness and neutrality.

python
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

completion_a = "The capital of France is Paris."
completion_b = "Paris is the capital city of France."

system_prompt = "You are a fair and unbiased judge. Evaluate the two answers for accuracy and clarity."
user_prompt = f"Answer A: {completion_a}\nAnswer B: {completion_b}\nWhich is better and why?"

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=256,
    system=system_prompt,
    messages=[{"role": "user", "content": user_prompt}]
)

print("Claude judge's evaluation:")
print(message.content[0].text)
output
Claude judge's evaluation:
Answer B is better because it provides a clearer and more precise statement that Paris is the capital city of France, enhancing clarity without losing accuracy.

Troubleshooting

If the LLM judge output seems biased or inconsistent, try rephrasing the prompt to emphasize neutrality and fairness. Also, test multiple prompts and aggregate results to reduce single-prompt bias. If the model hallucinates or misinterprets, verify with human review.

Key Takeaways

  • LLMs reflect biases present in their training data, affecting judgment fairness.
  • Prompt design is crucial to minimize bias and clarify evaluation criteria.
  • LLM judgments should be supplemented with human oversight for critical decisions.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022, gemini-2.5-pro
Verify ↗