How to build a custom LLM judge
Quick answer
Build a custom LLM judge by prompting a large language model like
gpt-4o-mini to evaluate outputs against criteria you define. Use the OpenAI SDK to send evaluation prompts and parse the model's judgment for automated scoring or feedback.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
This example shows how to create a simple LLM judge that evaluates if a generated answer correctly addresses a question. The judge prompts gpt-4o-mini to score the answer on correctness and provide a brief explanation.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
question = "What is the capital of France?"
generated_answer = "The capital of France is Paris."
prompt = f"You are an expert judge. Evaluate the following answer to the question:\nQuestion: {question}\nAnswer: {generated_answer}\n\nIs the answer correct? Reply with 'Correct' or 'Incorrect' and a short explanation."
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
judgment = response.choices[0].message.content
print("Judge's evaluation:", judgment) output
Judge's evaluation: Correct. The answer correctly identifies Paris as the capital of France.
Common variations
- Use different models like
claude-3-5-sonnet-20241022for potentially more nuanced judgments. - Implement asynchronous calls with
asynciofor batch evaluations. - Design multi-criteria prompts to judge based on relevance, completeness, and style.
Troubleshooting
- If the judge returns vague answers, refine the prompt to be more explicit about the expected format.
- For inconsistent scoring, add examples in the prompt to guide the model.
- Ensure your API key is valid and environment variable is set to avoid authentication errors.
Key Takeaways
- Use explicit, clear prompts to guide the LLM judge's evaluation criteria.
- Leverage the OpenAI SDK with
gpt-4o-minifor reliable judgment generation. - Customize the judge by adding examples or multiple evaluation dimensions in prompts.