LLMRerank: LLM-based reranking
Why this matters
Retrieval often returns many documents; simple vector similarity misses semantic nuance. LLM-based reranking catches the documents your vector store missed and filters noise before they reach your context window, improving answer quality and reducing token costs.
Explanation
LLMRerank is a post-retrieval stage that takes the top-k documents from your vector store and asks an LLM to score them by relevance to your query. Instead of trusting vector similarity alone, you invoke the LLM as a judge: "Given this query and these documents, which are actually most relevant?" The LLM returns scores; you re-sort by those scores and keep only the top results. Mechanically, LlamaIndex's LLMRerank class wraps a language model and integrates into the query pipeline. When you attach a reranker to a retriever, documents flow through vector retrieval first (fast, broad), then through the LLM reranker (slow, precise), then into your response synthesizer. This two-stage filtering is powerful because vector similarity is fundamentally a mathematical approximation: it doesn't understand semantics the way an LLM does. Use this when your retrieval is returning too much noise or when precision matters more than speed.
Analogy
Like hiring a senior editor to filter manuscripts. Your initial retriever is the automated scanner that pulls 100 plausible documents. The LLM reranker is the human expert who reads each one and says, 'Keep #3, #7, and #42: discard the rest.' The scanner was fast but crude; the expert is slower but right.
Code
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.postprocessor import LLMRerank
from llama_index.llms.openai import OpenAI
import os
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'
Settings.llm = OpenAI(model='gpt-4-turbo')
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
reranker = LLMRerank(
choice_batch_size=5,
top_n=3,
llm=OpenAI(model='gpt-4-turbo')
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)
response = query_engine.query('What are the main risks of climate change?')
print(f'Response: {response}')
print(f'\nSource documents after reranking:')
for node in response.source_nodes:
print(f' - {node.node.get_content()[:100]}...') Response: Climate change presents multiple risks including rising sea levels, extreme weather events, ecosystem disruption, agricultural yield reduction, and increased disease transmission. Source documents after reranking: - Climate change poses severe risks to coastal communities, including rising sea levels that threaten... - Extreme weather events linked to climate change have already caused billions in damages and are... - Biodiversity loss and ecosystem collapse represent critical long-term risks as species struggle...
What just happened?
The code created a VectorStoreIndex from documents, then instantiated an LLMRerank postprocessor with top_n=3 (keep top 3 after reranking) and choice_batch_size=5 (score 5 documents at a time). The query_engine was configured to retrieve 10 documents via vector similarity, then pass those 10 to the reranker, which scored them using GPT-4 and returned only the top 3. The response contains only reranked documents, not the full top-10 from the vector store.
Common gotcha
Developers assume LLMRerank is free. It isn't: you're calling your LLM on every document in the rerank pool. If you set similarity_top_k=50 and rerank all 50, you're paying for 50 LLM inferences per query. The cost trap: setting top_n too high ('rerank the top 20') defeats the purpose. Set similarity_top_k 2–3x higher than top_n and let the reranker filter aggressively.
Error recovery
ValueError: choice_batch_size must be > 0RateLimitError from OpenAIAttributeError: 'LLMRerank' object has no attribute 'llm'Experienced dev note
Reranking is a force multiplier for retrieval quality, but measure twice before deploying. Run a small benchmark: compare answer quality with and without reranking on your actual queries. Often you'll find that a better embedding model (e.g., switching from text-embedding-3-small to text-embedding-3-large) gives you 80% of the reranking benefit at 1/10th the cost. Also, batch your reranking by setting choice_batch_size to match your token budget: if your context is 2k tokens, you can safely batch-score 10 documents; if your context is 8k, you can batch-score 30. Never rerank in a tight loop without caching or you'll burn through your budget.
Check your understanding
You have 100 retrieved documents. Your reranker uses choice_batch_size=5 and top_n=3. Why would this configuration risk filtering out a highly relevant document, and what parameter would you change to reduce that risk?
Show answer hint
A correct answer recognizes that scoring only 5 documents at a time (batch_size=5) means 95 documents are never scored by the LLM: they're skipped. If a relevant document falls outside the scored batches, it's lost. To reduce this risk, you'd increase similarity_top_k (the initial vector retrieval count) so more documents are available to the reranker, or reduce batch_size so more documents are actually evaluated.