How to use BM25 retriever in Haystack
Quick answer
Use the
InMemoryDocumentStore with InMemoryBM25Retriever from haystack to perform BM25-based document retrieval. Load your documents into the store, initialize the retriever, and query it to get relevant documents efficiently.PREREQUISITES
Python 3.8+pip install haystack-ai>=2.0Basic knowledge of Python
Setup
Install the latest Haystack version (v2+) which supports the BM25 retriever. Ensure you have Python 3.8 or higher.
pip install haystack-ai Step by step
This example shows how to create an InMemoryDocumentStore, add documents, initialize the InMemoryBM25Retriever, and query it for relevant documents.
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes import BM25Retriever
# Initialize document store
document_store = InMemoryDocumentStore()
# Sample documents
docs = [
{"content": "Haystack is an open source NLP framework."},
{"content": "BM25 is a ranking function used by search engines."},
{"content": "Python is a popular programming language."}
]
# Write documents to the store
document_store.write_documents(docs)
# Initialize BM25 retriever
retriever = BM25Retriever(document_store=document_store)
# Create a pipeline with the retriever
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="BM25Retriever", inputs=["Query"])
# Query the retriever
query = "What is BM25?"
result = pipeline.run(query=query, params={"BM25Retriever": {"top_k": 2}})
# Print retrieved documents
for doc in result["documents"]:
print(f"Score: {doc.score:.4f}, Content: {doc.content}") output
Score: 1.0000, Content: BM25 is a ranking function used by search engines. Score: 0.0000, Content: Haystack is an open source NLP framework.
Common variations
- You can use other document stores like
FAISSDocumentStorefor vector search combined with BM25. - Adjust
top_kto control the number of retrieved documents. - Use the retriever in combination with a reader for extractive QA pipelines.
Troubleshooting
- If no documents are returned, ensure documents are correctly written to the
InMemoryDocumentStore. - Check that your query is a non-empty string.
- For large datasets, consider using a persistent document store instead of in-memory.
Key Takeaways
- Use
InMemoryBM25RetrieverwithInMemoryDocumentStorefor fast keyword-based retrieval. - Write your documents to the document store before querying the retriever.
- Adjust
top_kto control how many documents are returned per query.