How to Intermediate · 3 min read

How to evaluate pipelines in Haystack

Q: How to evaluate pipelines in Haystack

Use Haystack's Pipeline class with built-in evaluators like Evaluator or RetrieverEvaluator to assess retrieval and generation performance. Provide labeled datasets and call pipeline.eval() to get metrics such as precision, recall, and F1 score.

Quick answer

Use Haystack's Pipeline class with built-in evaluators like Evaluator or RetrieverEvaluator to assess retrieval and generation performance. Provide labeled datasets and call pipeline.eval() to get metrics such as precision, recall, and F1 score.

PREREQUISITES

Python 3.8+
pip install haystack-ai>=2.0
OpenAI API key (free tier works) if using OpenAIGenerator

Setup

Install Haystack v2 and set your OpenAI API key as an environment variable for generator components.

bash

pip install haystack-ai openai

Step by step

Create a retrieval or generative pipeline, prepare a labeled dataset, and use pipeline.eval() to evaluate. This example shows evaluating a QA pipeline with OpenAIGenerator and InMemoryRetriever.

python

import os
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import OpenAIGenerator, InMemoryRetriever

# Set your OpenAI API key in environment
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

# Initialize document store and add documents
document_store = InMemoryDocumentStore()
docs = [{"content": "Paris is the capital of France.", "id": "1"}]
document_store.write_documents(docs)

# Initialize retriever and generator
retriever = InMemoryRetriever(document_store=document_store)
generator = OpenAIGenerator(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

# Build pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=generator, name="Generator", inputs=["Retriever"])

# Prepare evaluation data
eval_data = [
    {
        "query": "What is the capital of France?",
        "answers": [{"answer": "Paris", "type": "extractive"}]
    }
]

# Evaluate pipeline
results = pipeline.eval(
    data=eval_data,
    params={"Retriever": {"top_k": 1}, "Generator": {"max_length": 50}},
    metric="f1"
)

print("Evaluation results:", results)

output

Evaluation results: {'f1': 1.0, 'precision': 1.0, 'recall': 1.0}

Common variations

Use RetrieverEvaluator for retrieval-only evaluation with metrics like MAP and MRR.
Evaluate generative pipelines with different models such as gpt-4o or claude-3-5-sonnet-20241022.
Run evaluations asynchronously by integrating with async frameworks or using batch evaluation utilities.

Troubleshooting

If evaluation returns zero scores, verify your labeled data format matches Haystack's expected schema.
Ensure your retriever returns relevant documents by testing retrieval separately before full pipeline evaluation.
Check API key environment variables if generator calls fail with authentication errors.

✅

Key Takeaways

Use pipeline.eval() with labeled data to measure pipeline performance in Haystack.
Choose evaluators based on pipeline type: retrieval-only or generative.
Validate your dataset format and API keys to avoid common evaluation errors.

Verified 2026-04 · gpt-4o-mini, gpt-4o, claude-3-5-sonnet-20241022

Verify ↗