How to Intermediate · 3 min read

How to build a custom LLM judge

Quick answer

Build a custom LLM judge by prompting a large language model like gpt-4o-mini to evaluate outputs against criteria you define. Use the OpenAI SDK to send evaluation prompts and parse the model's judgment for automated scoring or feedback.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash

pip install openai>=1.0

Step by step

This example shows how to create a simple LLM judge that evaluates if a generated answer correctly addresses a question. The judge prompts gpt-4o-mini to score the answer on correctness and provide a brief explanation.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

question = "What is the capital of France?"
generated_answer = "The capital of France is Paris."

prompt = f"You are an expert judge. Evaluate the following answer to the question:\nQuestion: {question}\nAnswer: {generated_answer}\n\nIs the answer correct? Reply with 'Correct' or 'Incorrect' and a short explanation."

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

judgment = response.choices[0].message.content
print("Judge's evaluation:", judgment)

output

Judge's evaluation: Correct. The answer correctly identifies Paris as the capital of France.

Common variations

Use different models like claude-3-5-sonnet-20241022 for potentially more nuanced judgments.
Implement asynchronous calls with asyncio for batch evaluations.
Design multi-criteria prompts to judge based on relevance, completeness, and style.

Troubleshooting

If the judge returns vague answers, refine the prompt to be more explicit about the expected format.
For inconsistent scoring, add examples in the prompt to guide the model.
Ensure your API key is valid and environment variable is set to avoid authentication errors.

✅

Key Takeaways

Use explicit, clear prompts to guide the LLM judge's evaluation criteria.
Leverage the OpenAI SDK with gpt-4o-mini for reliable judgment generation.
Customize the judge by adding examples or multiple evaluation dimensions in prompts.

Verified 2026-04 · gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗