How to Intermediate · 4 min read

How to design evaluation criteria for LLM judge

Quick answer
Design evaluation criteria for an LLM judge by defining clear, objective metrics such as accuracy, relevance, coherence, and fairness. Use a combination of quantitative scores and qualitative guidelines to ensure consistent and unbiased assessment of AI-generated outputs.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Define clear evaluation metrics

Start by selecting measurable criteria that reflect the quality of LLM outputs. Common metrics include:

  • Accuracy: Correctness of the information provided.
  • Relevance: How well the response addresses the prompt.
  • Coherence: Logical flow and clarity of the response.
  • Fairness and Bias: Absence of harmful or biased content.
  • Creativity: Novelty and insightfulness when applicable.

Each metric should have a clear definition and scoring rubric to minimize subjectivity.

MetricDescription
AccuracyCorrectness of factual information
RelevanceAlignment with the prompt or question
CoherenceLogical and clear structure
Fairness and BiasNeutrality and ethical considerations
CreativityOriginality and insightfulness

Implement evaluation with an LLM judge

Use an LLM as a judge by prompting it to score or rank outputs based on the defined criteria. Provide explicit instructions and examples to guide its evaluation.

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = '''You are an LLM judge. Evaluate the following response based on Accuracy, Relevance, and Coherence on a scale from 1 to 5.

Prompt: What is the capital of France?
Response: Paris is the capital city of France.

Provide your scores as JSON with keys "accuracy", "relevance", "coherence".'''

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
output
{
  "accuracy": 5,
  "relevance": 5,
  "coherence": 5
}

Common variations and best practices

Consider these variations to improve evaluation robustness:

  • Multi-metric aggregation: Combine scores into a weighted overall rating.
  • Use multiple judges: Aggregate evaluations from several LLM judges to reduce bias.
  • Human-in-the-loop: Incorporate human review for edge cases or ambiguous outputs.
  • Automate with SDKs: Use OpenAI or Anthropic SDKs for scalable evaluation pipelines.

Troubleshooting evaluation issues

If the LLM judge produces inconsistent or biased scores, try:

  • Refining prompt instructions with clearer examples.
  • Increasing the number of evaluation samples for statistical reliability.
  • Using a stronger or more specialized model like claude-3-5-sonnet-20241022 for nuanced judgment.
  • Validating scores against human annotations to calibrate the judge.

Key Takeaways

  • Define explicit, measurable evaluation metrics to guide the LLM judge.
  • Use clear prompt instructions and examples to ensure consistent scoring.
  • Aggregate multiple evaluations to reduce bias and improve reliability.
  • Incorporate human review for complex or ambiguous cases.
  • Refine prompts and model choice to troubleshoot inconsistent judgments.
Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022
Verify ↗