How to beginner · 3 min read

How to use Claude to evaluate LLM outputs

Q: How to use Claude to evaluate LLM outputs

Use Claude models via the Anthropic API to evaluate LLM outputs by prompting Claude to score or critique the generated text. Send the original prompt, the LLM output, and an evaluation instruction in the messages parameter to get a detailed assessment or score.

Quick answer

Use Claude models via the Anthropic API to evaluate LLM outputs by prompting Claude to score or critique the generated text. Send the original prompt, the LLM output, and an evaluation instruction in the messages parameter to get a detailed assessment or score.

PREREQUISITES

Python 3.8+
Anthropic API key
pip install anthropic>=0.20

Setup

Install the anthropic Python SDK and set your API key as an environment variable.

Run pip install anthropic to install the SDK.
Set your API key in your shell: export ANTHROPIC_API_KEY='your_api_key_here'.

bash

pip install anthropic

Step by step

Use the claude-3-5-sonnet-20241022 model to evaluate an LLM output by sending a prompt that includes the original input, the LLM's response, and an instruction to critique or score it.

python

import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Define the original prompt and the LLM output to evaluate
original_prompt = "Explain the benefits of renewable energy."
llm_output = "Renewable energy is good because it is clean and sustainable."

evaluation_instruction = (
    "You are an expert AI evaluator. Please rate the LLM output on accuracy, completeness, and clarity from 1 to 10, "
    "and provide a brief explanation for the score."
)

# Construct the message for Claude
messages = [
    {"role": "user", "content": (
        f"Original prompt: {original_prompt}\n"
        f"LLM output: {llm_output}\n"
        f"Instruction: {evaluation_instruction}"
    )}
]

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=256,
    system="You are a helpful assistant that evaluates AI-generated text.",
    messages=messages
)

evaluation = response.choices[0].message.content
print("Evaluation result:\n", evaluation)

output

Evaluation result:
Score: 8/10
Explanation: The output correctly identifies renewable energy as clean and sustainable, which is accurate. However, it lacks detail on specific benefits such as environmental impact, economic advantages, or energy security, so completeness is limited. The clarity is good but could be improved with more elaboration.

Common variations

You can customize the evaluation by:

Using different Claude models like claude-sonnet-4-5 for more advanced reasoning.
Adjusting max_tokens for longer or shorter evaluations.
Running evaluations asynchronously or integrating into pipelines.

python

import asyncio
import anthropic

async def async_evaluate():
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
    messages = [{"role": "user", "content": "Evaluate this output..."}]
    response = await client.messages.acreate(
        model="claude-sonnet-4-5",
        max_tokens=512,
        system="You are an expert evaluator.",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(async_evaluate())

output

Detailed evaluation text printed asynchronously.

Troubleshooting

If you get authentication errors, verify your ANTHROPIC_API_KEY environment variable is set correctly.
If the response is cut off, increase max_tokens.
For unclear evaluations, refine your instruction prompt to be more specific.

Key Takeaways

Use Claude models via the Anthropic API to prompt for evaluation of LLM outputs by including the original prompt, output, and evaluation instructions.
Customize evaluation detail and length by adjusting model choice and max_tokens parameters.
Handle errors by checking API keys and refining prompts for clearer, more useful feedback.

Verified 2026-04 · claude-3-5-sonnet-20241022, claude-sonnet-4-5

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.