How to beginner · 4 min read

How to build indexing pipeline in Haystack

Quick answer
Use haystack to build an indexing pipeline by loading documents into an InMemoryDocumentStore, creating a retriever like InMemoryBM25Retriever, and then indexing the documents. This pipeline enables efficient semantic search and retrieval with minimal setup.

PREREQUISITES

  • Python 3.8+
  • pip install haystack-ai openai
  • OpenAI API key (free tier works)
  • Set environment variable OPENAI_API_KEY

Setup

Install the latest Haystack v2 package and set your OpenAI API key as an environment variable.

  • Run pip install haystack-ai openai to install dependencies.
  • Export your OpenAI API key in your shell: export OPENAI_API_KEY='your_key_here'.
bash
pip install haystack-ai openai

Step by step

This example shows how to load text documents, create an in-memory document store, use a BM25 retriever, and index the documents for search.

python
from haystack import Pipeline
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
import os

# Sample documents
docs = [
    {"content": "Haystack is an open source NLP framework.", "meta": {"name": "doc1"}},
    {"content": "It supports semantic search and question answering.", "meta": {"name": "doc2"}},
    {"content": "You can build pipelines easily with Haystack.", "meta": {"name": "doc3"}}
]

# Initialize document store
document_store = InMemoryDocumentStore()

# Write documents to the store
document_store.write_documents(docs)

# Initialize BM25 retriever
retriever = InMemoryBM25Retriever(document_store=document_store)

# Index documents (BM25 indexes on write_documents automatically)
# For other stores, you might need document_store.update_embeddings(retriever)

# Build a simple pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])

# Query the pipeline
query = "What is Haystack?"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 2}})

print("Top documents:")
for doc in result["documents"]:
    print(f"- {doc.content}")
output
Top documents:
- Haystack is an open source NLP framework.
- You can build pipelines easily with Haystack.

Common variations

You can extend the pipeline by adding a generator like OpenAIGenerator for answer generation or switch to other retrievers such as DensePassageRetriever for semantic search. Async pipelines and streaming are also supported in Haystack v2.

python
import os
from haystack.components.generators import OpenAIGenerator

# Add generator to pipeline
generator = OpenAIGenerator(api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o-mini")
pipeline.add_node(component=generator, name="Generator", inputs=["Retriever"])

# Run pipeline with generation
query = "Explain Haystack framework"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 3}, "Generator": {"max_length": 100}})

print("Generated answer:")
print(result["answers"][0].answer)
output
Generated answer:
Haystack is an open source NLP framework that enables building pipelines for semantic search, question answering, and document retrieval.

Troubleshooting

  • If documents are not found, ensure they are correctly written to the DocumentStore before querying.
  • For large datasets, consider using persistent stores like FAISSDocumentStore or ElasticsearchDocumentStore instead of InMemoryDocumentStore.
  • If you get API errors, verify your OpenAI API key is set correctly in os.environ["OPENAI_API_KEY"].

Key Takeaways

  • Use InMemoryDocumentStore and InMemoryBM25Retriever for quick indexing and retrieval in Haystack.
  • Index documents by writing them to the document store before querying the retriever.
  • Extend pipelines with generators like OpenAIGenerator for answer generation.
  • Switch to persistent document stores for large-scale or production use.
  • Always set your OpenAI API key in environment variables to avoid authentication errors.
Verified 2026-04 · gpt-4o-mini
Verify ↗