Code Intermediate medium · 7 min

LLMRerank: LLM-based reranking

What you will learn

Use an LLM to intelligently re-order retrieved documents by relevance before passing them to your final query.

Why this matters

Retrieval often returns many documents; simple vector similarity misses semantic nuance. LLM-based reranking catches the documents your vector store missed and filters noise before they reach your context window, improving answer quality and reducing token costs.

Skip if: If your vector retriever is already highly accurate (90%+ precision on your domain), reranking overhead may not justify the extra LLM call. Also avoid for real-time latency-critical systems where sub-100ms response is required: reranking adds 500ms–2s per query.

Explanation

LLMRerank is a post-retrieval stage that takes the top-k documents from your vector store and asks an LLM to score them by relevance to your query. Instead of trusting vector similarity alone, you invoke the LLM as a judge: "Given this query and these documents, which are actually most relevant?" The LLM returns scores; you re-sort by those scores and keep only the top results. Mechanically, LlamaIndex's LLMRerank class wraps a language model and integrates into the query pipeline. When you attach a reranker to a retriever, documents flow through vector retrieval first (fast, broad), then through the LLM reranker (slow, precise), then into your response synthesizer. This two-stage filtering is powerful because vector similarity is fundamentally a mathematical approximation: it doesn't understand semantics the way an LLM does. Use this when your retrieval is returning too much noise or when precision matters more than speed.

Analogy

Like hiring a senior editor to filter manuscripts. Your initial retriever is the automated scanner that pulls 100 plausible documents. The LLM reranker is the human expert who reads each one and says, 'Keep #3, #7, and #42: discard the rest.' The scanner was fast but crude; the expert is slower but right.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.postprocessor import LLMRerank
from llama_index.llms.openai import OpenAI
import os

os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

Settings.llm = OpenAI(model='gpt-4-turbo')

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

reranker = LLMRerank(
    choice_batch_size=5,
    top_n=3,
    llm=OpenAI(model='gpt-4-turbo')
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

response = query_engine.query('What are the main risks of climate change?')
print(f'Response: {response}')
print(f'\nSource documents after reranking:')
for node in response.source_nodes:
    print(f'  - {node.node.get_content()[:100]}...')

Output

Response: Climate change presents multiple risks including rising sea levels, extreme weather events, ecosystem disruption, agricultural yield reduction, and increased disease transmission.

Source documents after reranking:
  - Climate change poses severe risks to coastal communities, including rising sea levels that threaten...
  - Extreme weather events linked to climate change have already caused billions in damages and are...
  - Biodiversity loss and ecosystem collapse represent critical long-term risks as species struggle...

What just happened?

The code created a VectorStoreIndex from documents, then instantiated an LLMRerank postprocessor with top_n=3 (keep top 3 after reranking) and choice_batch_size=5 (score 5 documents at a time). The query_engine was configured to retrieve 10 documents via vector similarity, then pass those 10 to the reranker, which scored them using GPT-4 and returned only the top 3. The response contains only reranked documents, not the full top-10 from the vector store.

Common gotcha

Developers assume LLMRerank is free. It isn't: you're calling your LLM on every document in the rerank pool. If you set similarity_top_k=50 and rerank all 50, you're paying for 50 LLM inferences per query. The cost trap: setting top_n too high ('rerank the top 20') defeats the purpose. Set similarity_top_k 2–3x higher than top_n and let the reranker filter aggressively.

Error recovery

ValueError: choice_batch_size must be > 0

You passed choice_batch_size=0 or a negative number. Set it to a positive integer, typically 5–10.

RateLimitError from OpenAI

The reranker is hitting rate limits because you're reranking too many documents per second across parallel queries. Reduce similarity_top_k or add a delay between queries.

AttributeError: 'LLMRerank' object has no attribute 'llm'

You're using llama-index < 0.10.0. Upgrade to llama-index-core >= 0.12.0 where LLMRerank accepts an explicit llm parameter.

Experienced dev note

Reranking is a force multiplier for retrieval quality, but measure twice before deploying. Run a small benchmark: compare answer quality with and without reranking on your actual queries. Often you'll find that a better embedding model (e.g., switching from text-embedding-3-small to text-embedding-3-large) gives you 80% of the reranking benefit at 1/10th the cost. Also, batch your reranking by setting choice_batch_size to match your token budget: if your context is 2k tokens, you can safely batch-score 10 documents; if your context is 8k, you can batch-score 30. Never rerank in a tight loop without caching or you'll burn through your budget.

Check your understanding

You have 100 retrieved documents. Your reranker uses choice_batch_size=5 and top_n=3. Why would this configuration risk filtering out a highly relevant document, and what parameter would you change to reduce that risk?

Show answer hint

A correct answer recognizes that scoring only 5 documents at a time (batch_size=5) means 95 documents are never scored by the LLM: they're skipped. If a relevant document falls outside the scored batches, it's lost. To reduce this risk, you'd increase similarity_top_k (the initial vector retrieval count) so more documents are available to the reranker, or reduce batch_size so more documents are actually evaluated.

VERSION LLMRerank was refactored in llama-index-core 0.10.0 to require an explicit llm parameter instead of using the global Settings.llm. Code below uses 0.12.x API. In versions < 0.10.0, omit the llm= argument and it will inherit from Settings automatically.

Once reranking is working, explore <strong>Fusion Retrieval</strong> to combine BM25 and semantic search before reranking: catching different document types that a single retriever misses.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.