How to use LLM as judge in LangSmith
Quick answer
Use the
LangSmith Python SDK to create a judge function that evaluates AI outputs by invoking an LLM like gpt-4o. Define the judge logic, then run it on LangSmith traces to automatically assess and score model responses.PREREQUISITES
Python 3.8+OpenAI API key (set as OPENAI_API_KEY)LangSmith API key (set as LANGSMITH_API_KEY)pip install langsmith openai
Setup
Install the required packages and set environment variables for OPENAI_API_KEY and LANGSMITH_API_KEY. This enables authenticated access to OpenAI models and LangSmith tracing services.
pip install langsmith openai Step by step
This example shows how to define a judge function using LangSmith's Client and OpenAI's OpenAI client. The judge uses an LLM to score AI-generated responses based on correctness or quality.
import os
from langsmith import Client
from openai import OpenAI
# Initialize LangSmith client with API key
langsmith_client = Client(api_key=os.environ["LANGSMITH_API_KEY"])
# Initialize OpenAI client
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Define a judge function that uses LLM to evaluate outputs
def llm_judge(prompt: str, response: str) -> str:
messages = [
{"role": "system", "content": "You are a judge that scores the quality of the response from 1 to 5."},
{"role": "user", "content": f"Prompt: {prompt}\nResponse: {response}\nScore the response from 1 (poor) to 5 (excellent) with a short explanation."}
]
completion = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=100
)
return completion.choices[0].message.content
# Example usage
prompt_text = "What is LangSmith used for?"
model_response = "LangSmith is a platform for tracing and evaluating AI model outputs."
# Run judge
score = llm_judge(prompt_text, model_response)
print("Judge output:\n", score)
# Optionally, log the judge result to LangSmith
trace_id = langsmith_client.log_trace(
inputs={"prompt": prompt_text},
outputs={"response": model_response, "judge": score}
)
print(f"Trace logged with ID: {trace_id}") output
Judge output: Score: 5 Explanation: The response correctly and concisely explains LangSmith's purpose. Trace logged with ID: <some-trace-id>
Common variations
You can adapt the judge to use different LLMs like claude-3-5-haiku-20241022 or run asynchronously. Also, customize the scoring criteria or output format to fit your evaluation needs.
import asyncio
async def async_llm_judge(prompt: str, response: str) -> str:
messages = [
{"role": "system", "content": "You are a judge scoring from 1 to 5."},
{"role": "user", "content": f"Prompt: {prompt}\nResponse: {response}\nScore with explanation."}
]
completion = await openai_client.chat.completions.acreate(
model="gpt-4o-mini",
messages=messages,
max_tokens=100
)
return completion.choices[0].message.content
# Run async judge
async def main():
score = await async_llm_judge("What is AI?", "AI stands for Artificial Intelligence.")
print("Async judge output:\n", score)
asyncio.run(main()) output
Async judge output: Score: 5 Explanation: Accurate and concise definition of AI.
Troubleshooting
- If you get authentication errors, verify your
OPENAI_API_KEYandLANGSMITH_API_KEYenvironment variables are set correctly. - If the judge returns irrelevant scores, refine the system prompt to clarify scoring criteria.
- For rate limits, consider batching evaluations or using a smaller model.
Key Takeaways
- Use LangSmith's
Clientto log and trace AI outputs alongside judge scores. - Invoke an LLM like
gpt-4o-minito programmatically score responses with custom prompts. - Customize judge logic and run synchronously or asynchronously depending on your app needs.