How to Intermediate · 3 min read

How to evaluate pipelines in Haystack

Quick answer
Use Haystack's Pipeline class with built-in evaluators like Evaluator or RetrieverEvaluator to assess retrieval and generation performance. Provide labeled datasets and call pipeline.eval() to get metrics such as precision, recall, and F1 score.

PREREQUISITES

  • Python 3.8+
  • pip install haystack-ai>=2.0
  • OpenAI API key (free tier works) if using OpenAIGenerator

Setup

Install Haystack v2 and set your OpenAI API key as an environment variable for generator components.

bash
pip install haystack-ai openai

Step by step

Create a retrieval or generative pipeline, prepare a labeled dataset, and use pipeline.eval() to evaluate. This example shows evaluating a QA pipeline with OpenAIGenerator and InMemoryRetriever.

python
import os
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import OpenAIGenerator, InMemoryRetriever

# Set your OpenAI API key in environment
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

# Initialize document store and add documents
document_store = InMemoryDocumentStore()
docs = [{"content": "Paris is the capital of France.", "id": "1"}]
document_store.write_documents(docs)

# Initialize retriever and generator
retriever = InMemoryRetriever(document_store=document_store)
generator = OpenAIGenerator(model="gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

# Build pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=generator, name="Generator", inputs=["Retriever"])

# Prepare evaluation data
eval_data = [
    {
        "query": "What is the capital of France?",
        "answers": [{"answer": "Paris", "type": "extractive"}]
    }
]

# Evaluate pipeline
results = pipeline.eval(
    data=eval_data,
    params={"Retriever": {"top_k": 1}, "Generator": {"max_length": 50}},
    metric="f1"
)

print("Evaluation results:", results)
output
Evaluation results: {'f1': 1.0, 'precision': 1.0, 'recall': 1.0}

Common variations

  • Use RetrieverEvaluator for retrieval-only evaluation with metrics like MAP and MRR.
  • Evaluate generative pipelines with different models such as gpt-4o or claude-3-5-sonnet-20241022.
  • Run evaluations asynchronously by integrating with async frameworks or using batch evaluation utilities.

Troubleshooting

  • If evaluation returns zero scores, verify your labeled data format matches Haystack's expected schema.
  • Ensure your retriever returns relevant documents by testing retrieval separately before full pipeline evaluation.
  • Check API key environment variables if generator calls fail with authentication errors.

Key Takeaways

  • Use pipeline.eval() with labeled data to measure pipeline performance in Haystack.
  • Choose evaluators based on pipeline type: retrieval-only or generative.
  • Validate your dataset format and API keys to avoid common evaluation errors.
Verified 2026-04 · gpt-4o-mini, gpt-4o, claude-3-5-sonnet-20241022
Verify ↗