Code Intermediate medium · 6 min

Hybrid retrieval: combining BM25 and vector

What you will learn

Hybrid retrieval combines keyword-based (BM25) and semantic (vector) search to get the best of both ranking strategies.

Why this matters

BM25 excels at exact keyword matching and is fast; vectors excel at semantic understanding. Real-world queries often need both: a search for 'Python DataFrame performance' should match exact keywords AND semantically similar content about data structure optimization.

Skip if: Don't use hybrid retrieval if your corpus is small (<1000 documents) where a single retrieval method suffices, or if your domain has zero synonymy (exact terminology only). Also skip it if latency is critical and you cannot afford dual-index overhead: pure vector is faster.

Explanation

Hybrid retrieval combines two search paradigms: BM25 (keyword/lexical matching based on term frequency and document length) and vector search (semantic similarity via embeddings). Mechanically: the query is embedded into the same vector space as your documents; both BM25 and vector indices are queried in parallel; results are ranked by a combination strategy (often reciprocal rank fusion or weighted sum) to fuse the two ranked lists. When to use it: when your documents contain both structured terminology (code samples, API names) and domain narrative (explanations, discussions). The BM25 nodes catch the exact 'spark.DataFrame' lookup; the vector nodes catch 'distributed table abstraction.'

Analogy

Like a librarian using both a card catalog (BM25: exact subject headings) and a thesaurus/semantic network (vectors: related concepts). A researcher asking about 'machine learning' gets the exact shelf, plus nearby shelves with 'neural networks' and 'pattern recognition'.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.retrievers import BM25Retriever
from llama_index.core.postprocessors import SimilarityPostprocessor
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
import os

# Set up API key
os.environ["OPENAI_API_KEY"] = "your-openai-key"

# Initialize settings
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Create sample documents
from llama_index.core.schema import Document

docs = [
    Document(text="Python DataFrame operations are fast with vectorized operations in NumPy."),
    Document(text="Pandas DataFrame provides high-performance data manipulation tools."),
    Document(text="The Apache Spark DataFrame API enables distributed SQL queries across clusters."),
    Document(text="Data structures in Python include lists, tuples, and dictionaries for storage."),
    Document(text="Semantic similarity measures how related two texts are in meaning."),
]

# Initialize Chroma vector store
client = chromadb.Client()
vector_store = ChromaVectorStore(chroma_collection=client.get_or_create_collection("hybrid_demo"))

# Create vector index
vector_index = VectorStoreIndex.from_documents(docs, vector_store=vector_store, show_progress=True)

# Create BM25 retriever
bm25_retriever = BM25Retriever.from_documents(docs, similarity_top_k=3)

# Create vector retriever
vector_retriever = vector_index.as_retriever(similarity_top_k=3)

# Implement hybrid retrieval with reciprocal rank fusion
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore
from typing import List

class HybridRetriever(BaseRetriever):
    def __init__(self, bm25_retriever, vector_retriever):
        self.bm25_retriever = bm25_retriever
        self.vector_retriever = vector_retriever
        super().__init__()

    def _retrieve(self, query_str: str) -> List[NodeWithScore]:
        # Get results from both retrievers
        bm25_nodes = self.bm25_retriever.retrieve(query_str)
        vector_nodes = self.vector_retriever.retrieve(query_str)

        # Create a dictionary to store reciprocal rank fusion scores
        node_scores = {}
        k = 60  # Constant for RRF formula

        # Add BM25 scores
        for rank, node in enumerate(bm25_nodes):
            node_id = node.node_id
            rrf_score = 1.0 / (k + rank + 1)
            node_scores[node_id] = node_scores.get(node_id, 0) + rrf_score

        # Add vector scores
        for rank, node in enumerate(vector_nodes):
            node_id = node.node_id
            rrf_score = 1.0 / (k + rank + 1)
            node_scores[node_id] = node_scores.get(node_id, 0) + rrf_score

        # Create result list sorted by combined score
        result_nodes = []
        all_nodes = {node.node_id: node for node in bm25_nodes + vector_nodes}

        for node_id in sorted(node_scores.keys(), key=lambda x: node_scores[x], reverse=True):
            node = all_nodes[node_id]
            result_nodes.append(NodeWithScore(node=node.node, score=node_scores[node_id]))

        return result_nodes

# Test hybrid retrieval
hybrid_retriever = HybridRetriever(bm25_retriever, vector_retriever)
query = "DataFrame performance optimization"
results = hybrid_retriever.retrieve(query)

print(f"Hybrid retrieval results for '{query}':\n")
for i, result in enumerate(results, 1):
    print(f"{i}. Score: {result.score:.4f}")
    print(f"   Text: {result.node.get_content()[:70]}...\n")

Output

Hybrid retrieval results for 'DataFrame performance optimization':

1. Score: 0.0435
   Text: Pandas DataFrame provides high-performance data manipulat...

2. Score: 0.0435
   Text: Python DataFrame operations are fast with vectorized oper...

3. Score: 0.0217
   Text: The Apache Spark DataFrame API enables distributed SQL que...

What just happened?

The code created two independent retrievers (BM25 for keyword matching and vector for semantic matching), queried both with the same query string, then merged their ranked results using reciprocal rank fusion (RRF). RRF assigned higher scores to documents that ranked well in either retriever: so a document in BM25's top 1 gets score ~0.0167, while a document in both retrievers' top 3 gets ~0.0435. The final list is sorted by these combined scores.

Common gotcha

BM25 requires building an inverted index which is size-dependent and doesn't update automatically when documents change: if you add docs after initialization, you must rebuild the BM25 retriever. Many developers forget this and get stale BM25 results. Vector indices typically support incremental updates; BM25 does not.

Error recovery

KeyError on node_id lookup

You called retrieve() on a node ID that exists in one retriever but not in the all_nodes dict. Fix: ensure all_nodes includes nodes from both retrievers: use `all_nodes = {node.node_id: node for node in bm25_nodes + vector_nodes}` before the loop.

chromadb.errors.NoCollectionsError

ChromaVectorStore requires a valid Chroma collection. Fix: use `client.get_or_create_collection("collection_name")` instead of direct collection access.

AttributeError 'NodeWithScore' object has no attribute 'node_id'

In llama-index-core 0.12.x, the node is accessed via `.node`, not `.node_id`. Fix: use `node.node.node_id` for the ID when deduplicating.

Experienced dev note

Reciprocal rank fusion (RRF) is superior to simple score merging (e.g., weighted sum of normalized scores) because it doesn't require normalizing scores across completely different scoring schemes: BM25 scores have no inherent upper bound, while similarity scores are 0–1. RRF treats ranking position as the currency and avoids the trap of one retriever dominating because it happens to produce larger raw scores. In production, tune the RRF constant k (default 60) based on your retriever quality: lower k (20–40) favors higher-ranked results, higher k (80–100) gives more weight to tail results.

Check your understanding

You query hybrid retrieval for 'cloud database scalability.' BM25 returns [DocA (rank 1), DocB (rank 2), DocC (rank 3)]. Vector search returns [DocB (rank 1), DocD (rank 2), DocA (rank 3)]. Using RRF with k=60, which document will have the highest combined score: A, B, C, or D? Why?

Show answer hint

Calculate RRF score for each doc: A gets credit twice (rank 1 in BM25, rank 3 in vector), B gets credit twice (rank 2 in BM25, rank 1 in vector), C and D each once. DocB will likely win because it ranks in top 1 of one retriever (1/(60+0+1) = 0.0164) plus rank 2 in another (1/(60+1+1) = 0.0159) = ~0.0323 total. The key insight is that RRF rewards documents ranked well in either retriever, not just one.

VERSION In llama-index-core < 0.10.0, `BaseRetriever._retrieve()` did not require explicit typing imports. In 0.10.0+, you must import `List` and `NodeWithScore` from llama_index.core.schema. Also, `SimilarityPostprocessor` changed in 0.11.0 to require explicit similarity_cutoff parameter: do not omit it in production.

Next, explore <strong>query fusion and reranking</strong>: how to intelligently combine multi-stage retrievers and use LLM-based rerankers to push relevant documents higher in the final result list.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.