How to Intermediate · 3 min read

How to score LLM outputs with rubrics

Quick answer
To score LLM outputs with rubrics, define clear evaluation criteria and use a scoring rubric to assign numeric or categorical scores to each output. Automate this by prompting an LLM (e.g., gpt-4o) with the rubric and output, then parse the model's scored response for consistent evaluation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0

Step by step

Define a rubric with criteria such as relevance, coherence, and factual accuracy. Then, prompt the LLM to score an output against this rubric. Parse the response to extract scores.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define rubric and output to score
rubric = (
    "Score the following output on a scale of 1 to 5 for each criterion:\n"
    "1. Relevance: How well does the output answer the question?\n"
    "2. Coherence: Is the output logically consistent and clear?\n"
    "3. Factual accuracy: Are the facts correct?\n"
    "Provide your answer as a JSON object with keys 'relevance', 'coherence', 'accuracy'."
)

output_to_score = "The capital of France is Berlin."

prompt = f"{rubric}\nOutput: {output_to_score}\nScores:" 

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

scores_text = response.choices[0].message.content
print("Scoring result:\n", scores_text)
output
Scoring result:
 {"relevance": 4, "coherence": 5, "accuracy": 1}

Common variations

You can use asynchronous calls with asyncio for batch scoring or switch to other models like claude-3-5-sonnet-20241022 for different evaluation styles. Streaming responses are less common for rubric scoring but possible.

python
import asyncio
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def score_output_async(output: str):
    rubric = (
        "Score the output on relevance, coherence, and accuracy (1-5)."
        "Return JSON with keys 'relevance', 'coherence', 'accuracy'."
    )
    prompt = f"{rubric}\nOutput: {output}\nScores:"

    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    outputs = [
        "Paris is the capital of France.",
        "The sun revolves around the Earth."
    ]
    tasks = [score_output_async(o) for o in outputs]
    results = await asyncio.gather(*tasks)
    for i, res in enumerate(results):
        print(f"Output {i+1} scores:\n{res}\n")

# Run async example
# asyncio.run(main())

Troubleshooting

  • If the model returns unstructured text instead of JSON, explicitly instruct it to respond only with JSON.
  • If scores are missing or inconsistent, increase prompt clarity or use few-shot examples.
  • For API errors, verify your API key and model availability.

Key Takeaways

  • Use explicit, structured prompts to get consistent rubric scores from LLMs.
  • Automate rubric scoring by parsing JSON responses from models like gpt-4o.
  • Async calls enable efficient batch scoring of multiple outputs.
  • Clear rubric criteria improve evaluation objectivity and reproducibility.
  • Troubleshoot by refining prompts and verifying API credentials.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗