How to Intermediate · 3 min read

How to use LLM as judge in LangSmith

Q: How to use LLM as judge in LangSmith

Use the LangSmith Python SDK to create a judge function that evaluates AI outputs by invoking an LLM like gpt-4o. Define the judge logic, then run it on LangSmith traces to automatically assess and score model responses.

Quick answer

Use the LangSmith Python SDK to create a judge function that evaluates AI outputs by invoking an LLM like gpt-4o. Define the judge logic, then run it on LangSmith traces to automatically assess and score model responses.

PREREQUISITES

Python 3.8+
OpenAI API key (set as OPENAI_API_KEY)
LangSmith API key (set as LANGSMITH_API_KEY)
pip install langsmith openai

Setup

Install the required packages and set environment variables for OPENAI_API_KEY and LANGSMITH_API_KEY. This enables authenticated access to OpenAI models and LangSmith tracing services.

bash

pip install langsmith openai

Step by step

This example shows how to define a judge function using LangSmith's Client and OpenAI's OpenAI client. The judge uses an LLM to score AI-generated responses based on correctness or quality.

python

import os
from langsmith import Client
from openai import OpenAI

# Initialize LangSmith client with API key
langsmith_client = Client(api_key=os.environ["LANGSMITH_API_KEY"])

# Initialize OpenAI client
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Define a judge function that uses LLM to evaluate outputs
def llm_judge(prompt: str, response: str) -> str:
    messages = [
        {"role": "system", "content": "You are a judge that scores the quality of the response from 1 to 5."},
        {"role": "user", "content": f"Prompt: {prompt}\nResponse: {response}\nScore the response from 1 (poor) to 5 (excellent) with a short explanation."}
    ]
    completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=100
    )
    return completion.choices[0].message.content

# Example usage
prompt_text = "What is LangSmith used for?"
model_response = "LangSmith is a platform for tracing and evaluating AI model outputs."

# Run judge
score = llm_judge(prompt_text, model_response)
print("Judge output:\n", score)

# Optionally, log the judge result to LangSmith
trace_id = langsmith_client.log_trace(
    inputs={"prompt": prompt_text},
    outputs={"response": model_response, "judge": score}
)
print(f"Trace logged with ID: {trace_id}")

output

Judge output:
Score: 5
Explanation: The response correctly and concisely explains LangSmith's purpose.
Trace logged with ID: <some-trace-id>

Common variations

You can adapt the judge to use different LLMs like claude-3-5-haiku-20241022 or run asynchronously. Also, customize the scoring criteria or output format to fit your evaluation needs.

python

import asyncio

async def async_llm_judge(prompt: str, response: str) -> str:
    messages = [
        {"role": "system", "content": "You are a judge scoring from 1 to 5."},
        {"role": "user", "content": f"Prompt: {prompt}\nResponse: {response}\nScore with explanation."}
    ]
    completion = await openai_client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=100
    )
    return completion.choices[0].message.content

# Run async judge
async def main():
    score = await async_llm_judge("What is AI?", "AI stands for Artificial Intelligence.")
    print("Async judge output:\n", score)

asyncio.run(main())

output

Async judge output:
Score: 5
Explanation: Accurate and concise definition of AI.

Troubleshooting

If you get authentication errors, verify your OPENAI_API_KEY and LANGSMITH_API_KEY environment variables are set correctly.
If the judge returns irrelevant scores, refine the system prompt to clarify scoring criteria.
For rate limits, consider batching evaluations or using a smaller model.

Key Takeaways

Use LangSmith's Client to log and trace AI outputs alongside judge scores.
Invoke an LLM like gpt-4o-mini to programmatically score responses with custom prompts.
Customize judge logic and run synchronously or asynchronously depending on your app needs.

Verified 2026-04 · gpt-4o-mini, claude-3-5-haiku-20241022

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.