How to intermediate · 4 min read

How to A/B test prompts in LangSmith

Quick answer
Use the LangSmith Python SDK to create multiple prompt variants as separate runs or experiments, then compare their outputs and metrics programmatically. Leverage LangSmith's tracing and project features to organize, track, and analyze A/B test results efficiently.

PREREQUISITES

  • Python 3.8+
  • pip install langsmith
  • LangSmith API key set in LANGSMITH_API_KEY environment variable

Setup

Install the langsmith Python package and set your API key as an environment variable to authenticate with LangSmith.

bash
pip install langsmith

Step by step

This example demonstrates how to create two prompt variants, run them through a LangSmith-traced LLM client, and compare their outputs for A/B testing.

python
import os
from langsmith import Client, traceable
from langchain_openai import ChatOpenAI

# Initialize LangSmith client
client = Client(api_key=os.environ["LANGSMITH_API_KEY"])

# Initialize OpenAI chat model with LangSmith tracing
llm = ChatOpenAI(model="gpt-4o-mini")

@traceable()
def run_prompt(prompt: str) -> str:
    response = llm.invoke([{"role": "user", "content": prompt}])
    return response.content

# Define two prompt variants for A/B testing
prompt_a = "Explain the benefits of A/B testing in AI prompt engineering."
prompt_b = "Describe why prompt A/B testing improves AI model performance."

# Run both prompts
result_a = run_prompt(prompt_a)
result_b = run_prompt(prompt_b)

print("Prompt A output:\n", result_a)
print("\nPrompt B output:\n", result_b)

# Optionally, log results or metrics to LangSmith for comparison
client.log_run(
    name="Prompt A",
    inputs={"prompt": prompt_a},
    outputs={"response": result_a}
)
client.log_run(
    name="Prompt B",
    inputs={"prompt": prompt_b},
    outputs={"response": result_b}
)
output
Prompt A output:
 A/B testing in AI prompt engineering helps identify the most effective prompts by comparing different versions, leading to improved model responses and user satisfaction.

Prompt B output:
 Prompt A/B testing improves AI model performance by systematically evaluating prompt variations to find the best phrasing that yields accurate and relevant outputs.

Common variations

You can run A/B tests asynchronously or with streaming responses by adapting the LangSmith tracing decorators and LangChain model calls. Also, test different models or temperature settings by changing the ChatOpenAI initialization.

python
import asyncio
from langsmith import Client, traceable
from langchain_openai import ChatOpenAI

client = Client(api_key=os.environ["LANGSMITH_API_KEY"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

@traceable()
async def run_prompt_async(prompt: str) -> str:
    response = await llm.invoke_async([{"role": "user", "content": prompt}])
    return response.content

async def main():
    prompt_a = "Explain A/B testing benefits in AI prompt engineering."
    prompt_b = "Describe why prompt A/B testing improves AI model performance."

    result_a, result_b = await asyncio.gather(
        run_prompt_async(prompt_a),
        run_prompt_async(prompt_b)
    )

    print("Prompt A output:\n", result_a)
    print("\nPrompt B output:\n", result_b)

asyncio.run(main())
output
Prompt A output:
 A/B testing in AI prompt engineering helps identify the most effective prompts by comparing different versions, leading to improved model responses and user satisfaction.

Prompt B output:
 Prompt A/B testing improves AI model performance by systematically evaluating prompt variations to find the best phrasing that yields accurate and relevant outputs.

Troubleshooting

  • If you see authentication errors, verify your LANGSMITH_API_KEY environment variable is set correctly.
  • If runs are not logged, ensure you call client.log_run() after each test to record results.
  • For network issues, check your internet connection and LangSmith service status.

Key Takeaways

  • Use LangSmith's Python SDK and tracing decorators to run and track prompt variants easily.
  • Log each prompt run with descriptive names and inputs/outputs for clear A/B test comparison.
  • Leverage async and streaming capabilities for efficient large-scale prompt testing.
  • Always verify API key setup and logging calls to ensure data is captured in LangSmith.
Verified 2026-04 · gpt-4o-mini
Verify ↗