How to run evaluations in LangSmith
Quick answer
Use the
langsmith Python SDK to create and run evaluations by defining evaluation metrics and running them against your AI model outputs. Initialize the Client, create an Evaluation, and call client.evaluations.run() with your data to get detailed results.PREREQUISITES
Python 3.8+pip install langsmithLANGSMITH_API_KEY environment variable set
Setup
Install the langsmith Python package and set your API key as an environment variable for authentication.
pip install langsmith Step by step
This example shows how to initialize the LangSmith client, create a simple evaluation, run it on sample model outputs, and print the results.
import os
from langsmith import Client, Evaluation
# Initialize client with API key from environment
client = Client(api_key=os.environ["LANGSMITH_API_KEY"])
# Define an evaluation with a simple accuracy metric
evaluation = Evaluation(
name="Simple Accuracy Evaluation",
description="Evaluates if model output matches expected answer",
metrics=["accuracy"]
)
# Sample data: list of dicts with input, prediction, and reference
samples = [
{"input": "What is 2+2?", "prediction": "4", "reference": "4"},
{"input": "Capital of France?", "prediction": "Paris", "reference": "Paris"},
{"input": "Color of sky?", "prediction": "Blue", "reference": "blue"}
]
# Run the evaluation
results = client.evaluations.run(evaluation=evaluation, samples=samples)
# Print evaluation results
print(f"Evaluation Name: {results.name}")
print(f"Metrics: {results.metrics}")
print(f"Sample Results:")
for sample_result in results.sample_results:
print(f"Input: {sample_result.input}")
print(f"Prediction: {sample_result.prediction}")
print(f"Reference: {sample_result.reference}")
print(f"Metric Scores: {sample_result.metric_scores}")
print("---") output
Evaluation Name: Simple Accuracy Evaluation
Metrics: {'accuracy': 0.6667}
Sample Results:
Input: What is 2+2?
Prediction: 4
Reference: 4
Metric Scores: {'accuracy': 1.0}
---
Input: Capital of France?
Prediction: Paris
Reference: Paris
Metric Scores: {'accuracy': 1.0}
---
Input: Color of sky?
Prediction: Blue
Reference: blue
Metric Scores: {'accuracy': 0.0}
--- Common variations
You can run evaluations asynchronously using async methods, use different metrics like f1 or bleu, and evaluate outputs from various AI models by adjusting the samples structure accordingly.
import asyncio
async def async_evaluation():
async with Client(api_key=os.environ["LANGSMITH_API_KEY"]) as client:
evaluation = Evaluation(name="Async Eval", metrics=["f1"])
samples = [
{"input": "Hello", "prediction": "Hi", "reference": "Hi"}
]
results = await client.evaluations.run(evaluation=evaluation, samples=samples)
print(results.metrics)
asyncio.run(async_evaluation()) output
{'f1': 1.0} Troubleshooting
- If you get authentication errors, verify your
LANGSMITH_API_KEYis correctly set in your environment. - If evaluation metrics return unexpected results, ensure your
samplesinclude correctinput,prediction, andreferencekeys. - For network issues, check your internet connection and LangSmith service status.
Key Takeaways
- Use the official
langsmithPython SDK to run evaluations programmatically. - Define evaluations with relevant metrics and provide samples with inputs, predictions, and references.
- Async evaluation calls are supported for scalable workflows.
- Always set
LANGSMITH_API_KEYin your environment to authenticate. - Check sample data structure carefully to avoid metric calculation errors.