How to use BM25 retriever in LlamaIndex
Quick answer
Use the
BM25Retriever class from llama_index to perform keyword-based document retrieval. Initialize it with your GPTVectorStoreIndex or SimpleDirectoryReader loaded documents, then call retrieve() with your query to get ranked results.PREREQUISITES
Python 3.8+pip install llama-index>=0.6.0pip install openai>=1.0OpenAI API key set in environment variable OPENAI_API_KEY
Setup
Install the llama-index package and set your OpenAI API key in the environment. This example uses the BM25 retriever included in LlamaIndex for keyword-based search.
pip install llama-index openai Step by step
This example loads documents from a directory, builds an index, and uses the BM25Retriever to retrieve relevant documents for a query.
import os
from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex
from llama_index.retrievers import BM25Retriever
# Load documents from a directory
documents = SimpleDirectoryReader('data').load_data()
# Build a vector store index (required by BM25Retriever)
index = GPTVectorStoreIndex.from_documents(documents)
# Initialize BM25 retriever with the index
bm25_retriever = BM25Retriever(index=index)
# Query to retrieve documents
query = "What is the impact of climate change?"
# Retrieve top documents
results = bm25_retriever.retrieve(query)
# Print retrieved documents' text
for i, doc in enumerate(results):
print(f"Document {i+1}:\n{doc.get_text()}\n") output
Document 1: Climate change impacts include rising sea levels, extreme weather, and biodiversity loss. Document 2: The effects of climate change on agriculture are significant and require adaptation.
Common variations
- Use
BM25Retrieverwith different index types likeGPTSimpleVectorIndex. - Adjust the number of retrieved documents by passing
top_kparameter toretrieve(). - Combine BM25 with other retrievers for hybrid search strategies.
results = bm25_retriever.retrieve(query, top_k=5) Troubleshooting
- If retrieval returns no results, ensure your documents are properly loaded and indexed.
- Check that the
datadirectory contains readable text files. - Verify your environment variable
OPENAI_API_KEYis set correctly.
Key Takeaways
- Use
BM25Retrieverfromllama_index.retrieversfor keyword-based document retrieval. - Initialize
BM25Retrieverwith a vector store index built from your documents. - Adjust retrieval parameters like
top_kto control the number of results returned. - Ensure documents are loaded correctly with
SimpleDirectoryReaderor similar loaders. - Set your OpenAI API key in
os.environ["OPENAI_API_KEY"]before running the code.