How to use Claude to evaluate LLM outputs
Quick answer
Use
Claude models via the Anthropic API to evaluate LLM outputs by prompting Claude to score or critique the generated text. Send the original prompt, the LLM output, and an evaluation instruction in the messages parameter to get a detailed assessment or score.PREREQUISITES
Python 3.8+Anthropic API keypip install anthropic>=0.20
Setup
Install the anthropic Python SDK and set your API key as an environment variable.
- Run
pip install anthropicto install the SDK. - Set your API key in your shell:
export ANTHROPIC_API_KEY='your_api_key_here'.
pip install anthropic Step by step
Use the claude-3-5-sonnet-20241022 model to evaluate an LLM output by sending a prompt that includes the original input, the LLM's response, and an instruction to critique or score it.
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Define the original prompt and the LLM output to evaluate
original_prompt = "Explain the benefits of renewable energy."
llm_output = "Renewable energy is good because it is clean and sustainable."
evaluation_instruction = (
"You are an expert AI evaluator. Please rate the LLM output on accuracy, completeness, and clarity from 1 to 10, "
"and provide a brief explanation for the score."
)
# Construct the message for Claude
messages = [
{"role": "user", "content": (
f"Original prompt: {original_prompt}\n"
f"LLM output: {llm_output}\n"
f"Instruction: {evaluation_instruction}"
)}
]
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
system="You are a helpful assistant that evaluates AI-generated text.",
messages=messages
)
evaluation = response.choices[0].message.content
print("Evaluation result:\n", evaluation) output
Evaluation result: Score: 8/10 Explanation: The output correctly identifies renewable energy as clean and sustainable, which is accurate. However, it lacks detail on specific benefits such as environmental impact, economic advantages, or energy security, so completeness is limited. The clarity is good but could be improved with more elaboration.
Common variations
You can customize the evaluation by:
- Using different Claude models like
claude-sonnet-4-5for more advanced reasoning. - Adjusting
max_tokensfor longer or shorter evaluations. - Running evaluations asynchronously or integrating into pipelines.
import asyncio
import anthropic
async def async_evaluate():
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
messages = [{"role": "user", "content": "Evaluate this output..."}]
response = await client.messages.acreate(
model="claude-sonnet-4-5",
max_tokens=512,
system="You are an expert evaluator.",
messages=messages
)
print(response.choices[0].message.content)
asyncio.run(async_evaluate()) output
Detailed evaluation text printed asynchronously.
Troubleshooting
- If you get authentication errors, verify your
ANTHROPIC_API_KEYenvironment variable is set correctly. - If the response is cut off, increase
max_tokens. - For unclear evaluations, refine your instruction prompt to be more specific.
Key Takeaways
- Use Claude models via the Anthropic API to prompt for evaluation of LLM outputs by including the original prompt, output, and evaluation instructions.
- Customize evaluation detail and length by adjusting model choice and max_tokens parameters.
- Handle errors by checking API keys and refining prompts for clearer, more useful feedback.