How to score LLM outputs with rubrics
Quick answer
To score LLM outputs with rubrics, define clear evaluation criteria and use a scoring rubric to assign numeric or categorical scores to each output. Automate this by prompting an LLM (e.g., gpt-4o) with the rubric and output, then parse the model's scored response for consistent evaluation.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai>=1.0 Step by step
Define a rubric with criteria such as relevance, coherence, and factual accuracy. Then, prompt the LLM to score an output against this rubric. Parse the response to extract scores.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define rubric and output to score
rubric = (
"Score the following output on a scale of 1 to 5 for each criterion:\n"
"1. Relevance: How well does the output answer the question?\n"
"2. Coherence: Is the output logically consistent and clear?\n"
"3. Factual accuracy: Are the facts correct?\n"
"Provide your answer as a JSON object with keys 'relevance', 'coherence', 'accuracy'."
)
output_to_score = "The capital of France is Berlin."
prompt = f"{rubric}\nOutput: {output_to_score}\nScores:"
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
scores_text = response.choices[0].message.content
print("Scoring result:\n", scores_text) output
Scoring result:
{"relevance": 4, "coherence": 5, "accuracy": 1} Common variations
You can use asynchronous calls with asyncio for batch scoring or switch to other models like claude-3-5-sonnet-20241022 for different evaluation styles. Streaming responses are less common for rubric scoring but possible.
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def score_output_async(output: str):
rubric = (
"Score the output on relevance, coherence, and accuracy (1-5)."
"Return JSON with keys 'relevance', 'coherence', 'accuracy'."
)
prompt = f"{rubric}\nOutput: {output}\nScores:"
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
async def main():
outputs = [
"Paris is the capital of France.",
"The sun revolves around the Earth."
]
tasks = [score_output_async(o) for o in outputs]
results = await asyncio.gather(*tasks)
for i, res in enumerate(results):
print(f"Output {i+1} scores:\n{res}\n")
# Run async example
# asyncio.run(main()) Troubleshooting
- If the model returns unstructured text instead of JSON, explicitly instruct it to respond only with JSON.
- If scores are missing or inconsistent, increase prompt clarity or use few-shot examples.
- For API errors, verify your API key and model availability.
Key Takeaways
- Use explicit, structured prompts to get consistent rubric scores from LLMs.
- Automate rubric scoring by parsing JSON responses from models like gpt-4o.
- Async calls enable efficient batch scoring of multiple outputs.
- Clear rubric criteria improve evaluation objectivity and reproducibility.
- Troubleshoot by refining prompts and verifying API credentials.