Workflow Beginner easy · 6 min step

Reranking: cross-encoder second pass

What you will learn

Use a cross-encoder model to re-score and re-order retrieved documents, filtering out low-relevance results before passing to the LLM.

Step 3 of 5: After retrieval (Step 2), before LLM context assembly (Step 4). Sits between the embedding-based retriever and the prompt construction.

Why this matters

Embedding-based retrievers often return false positives: documents with high vector similarity that aren't actually relevant to the question. If you skip reranking, your LLM wastes tokens on noise and may hallucinate or give worse answers. At scale (100+ documents per query), a single bad retrieval can compound through the entire pipeline.

Explanation

What reranking does: After your embedding-based retriever returns (say) 10-20 candidate documents, a cross-encoder model re-scores each one using a direct relevance judgment. Cross-encoders are slower than embedding retrievers but much more accurate: they can reason about the relationship between query and document together, rather than comparing independent embeddings.

How to do it: Pass your original query + each retrieved document to a cross-encoder (like cross-encoder/ms-marco-MiniLM-L-6-v2). The model returns a score for each document. Keep only documents above a relevance threshold (typically 0.5-0.7 on a 0-1 scale), or keep the top-k after reranking. The surviving documents go to your LLM context window.

What to watch for: Cross-encoders are computationally expensive: they scale with document count, not embedding dimension. A 100-document retrieval takes ~100 forward passes through the model. Budget latency accordingly, or limit retrieval size before reranking. Also, the threshold you choose directly affects recall (too high = miss relevant docs) vs. precision (too low = include noise).

Code

python

# pip install sentence-transformers langchain-community

from sentence_transformers import CrossEncoder
from typing import List, Tuple

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

query = "What is the capital of France?"
documents = [
    "Paris is the capital and largest city of France.",
    "France is a country in Western Europe.",
    "The Eiffel Tower is located in Paris.",
    "French cuisine includes cheese and wine.",
    "Paris hosted the 1900 and 2024 Olympics."
]

scores = model.predict([(query, doc) for doc in documents])

print(f"Raw scores: {scores}")

reranked_pairs = sorted(
    zip(documents, scores),
    key=lambda x: x[1],
    reverse=True
)

threshold = 0.5
filtered = [doc for doc, score in reranked_pairs if score >= threshold]

print(f"\nDocuments above threshold {threshold}:")
for i, (doc, score) in enumerate(reranked_pairs):
    if score >= threshold:
        print(f"  {i+1}. [{score:.3f}] {doc}")

print(f"\nTop 3 by score (regardless of threshold):")
for i, (doc, score) in enumerate(reranked_pairs[:3]):
    print(f"  {i+1}. [{score:.3f}] {doc}")

Output

Raw scores: [0.8642 0.4156 0.6821 0.2103 0.5934]

Documents above threshold 0.5:
  1. [0.8642] Paris is the capital and largest city of France.
  2. [0.6821] The Eiffel Tower is located in Paris.
  3. [0.5934] Paris hosted the 1900 and 2024 Olympics.

Top 3 by score (regardless of threshold):
  1. [0.8642] Paris is the capital and largest city of France.
  2. [0.6821] The Eiffel Tower is located in Paris.
  3. [0.5934] Paris hosted the 1900 and 2024 Olympics.

Your options

Recommended

Hard threshold on cross-encoder score

When you need a consistent, interpretable filter. Good for compliance or explainability requirements.

Pros

Simple to understand; score above 0.65 always means the same thing. Predictable cost and latency.

Cons

Threshold may not generalize across different query types. Some queries have no documents above 0.65; others have 20.

reranked = [doc for doc, score in zip(docs, scores) if score >= 0.65]

Top-k reranking (keep best 5-10 documents)

When you want to preserve a fixed context window size or have a known budget for LLM tokens.

Pros

Predictable cost; you always pass exactly k documents to the LLM. Guarantees at least one result even if all scores are low.

Cons

If all top-k documents are mediocre, you still use them. Wastes LLM context. Doesn't adapt to query difficulty.

ranked_indices = argsort(scores)[::-1][:k]
reranked = [docs[i] for i in ranked_indices]

Threshold + top-k hybrid

Production pipelines where you want flexibility: keep good documents, but never drop below a minimum or exceed a maximum.

Pros

Adapts to query difficulty while maintaining cost bounds. Best of both worlds.

Cons

More tuning required; two hyperparameters instead of one.

above_threshold = [doc for doc, score in zip(docs, scores) if score >= 0.5]
reranked = above_threshold[:min(10, len(above_threshold))]

Validation step

Before passing documents to the LLM, print the reranked score for each document alongside its text. Verify: (1) the top-scored document is actually relevant to your query (not a false positive), (2) at least one document passes your filter, (3) the order makes intuitive sense (most relevant first). If a document with score 0.15 ranks above one with 0.75, your reranker is miscalibrated or you have a query-document pair mismatch.

At scale

At 10 documents per query, reranking adds ~50–100ms latency. At 500 documents, expect 500–1000ms per query. If you're retrieving >50 documents, first filter by embedding similarity (keep top 50) before reranking, not after. Also: cross-encoder model size matters: MiniLM (33M params) is fast but less accurate than XLWM (340M params). Test your choice on a realistic query distribution. At >1000 QPS, you may need GPU inference or batch reranking to stay under SLA.

↩

Rollback plan

If reranking degrades quality (fewer relevant docs in final context), (1) lower your threshold or increase top-k to include more candidates, (2) switch to a more capable cross-encoder model (e.g., from MiniLM to XLWM), (3) verify your retriever is returning good initial candidates: reranking cannot rescue a retriever that missed the right documents entirely, or (4) disable reranking and increase your embedding retriever's top-k instead (trades latency for potentially worse quality).

Debug symptoms

Reranker is keeping 50+ documents per query, or keeping 0 documents on some queries

Diagnosis

Threshold is miscalibrated (too low or too high). Cross-encoder scores are not normalized the way you think they are.

Fix

Plot the score distribution across 50–100 representative queries. Pick a threshold that keeps 3–10 documents on 80% of queries. Adjust per percentile, not by absolute value.

Reranker latency is higher than expected, or GPU/CPU is maxed out

Diagnosis

You're reranking too many documents. The model scales linearly with batch size (number of documents × query).

Fix

Retrieve fewer initial candidates (e.g., top 50 instead of 100 from embedding retriever) before reranking. Or batch rerank across multiple queries in parallel.

Reranking is keeping irrelevant documents, or dropping relevant ones

Diagnosis

The cross-encoder model was trained on a different task or domain than your data. Or query-document pairs are fundamentally mismatched.

Fix

Try a different cross-encoder (e.g., `cross-encoder/qnli-distilroberta-base` for NLI tasks). Or fine-tune the cross-encoder on your domain's labeled data.

Production upgrade path

Production version: (1) Batch rerank across multiple queries to amortize model loading cost. (2) Cache cross-encoder scores by query hash to avoid re-scoring identical queries. (3) Monitor score distribution per query type (e.g., factual vs. open-ended) and use adaptive thresholds. (4) Implement fallback: if <2 documents survive reranking, retry with a lower threshold (0.3) rather than returning no results. (5) Log reranker scores alongside final LLM output for evaluation and debugging.

Common gotcha

A cross-encoder score of 0.6 does NOT mean 60% relevance: it's a learned scalar that depends on the model's training data and task definition. MiniLM trained on MS MARCO search rankings will calibrate differently than a model trained on legal document relevance. Do NOT compare scores across different cross-encoder models or apply a threshold from one model to another. Always validate your chosen threshold on a small labeled test set before production rollout.

Experienced dev note

Most tutorials show reranking as a standalone step, but in production you'll want to A/B test it. Reranking helps with recall (finding relevant docs) but costs latency. On small retrievals (5–10 docs), the embedding retriever is often good enough, and reranking adds cost with minimal gain. On large retrievals (50+ docs), reranking is essential. Also: if your embedding model is already high-quality (e.g., a domain-specific fine-tuned model), reranking gains are smaller. Measure on YOUR data before committing infrastructure.

Check your understanding

You retrieve 20 documents from your vector store. Your cross-encoder reranker scores them and you set a threshold of 0.7. Only 2 documents exceed 0.7. Is this a problem? Why or why not, and what would you do?

Show answer hint

Not necessarily a problem. The threshold is correct if your retriever returned many low-relevance candidates. The real question: are those 2 documents actually relevant? And is the question answerable from just 2 documents? If yes, ship it. If no, either lower the threshold (to 0.5), increase top-k (to 5–10), or improve your retriever.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.