How to intermediate · 3 min read

Fix slow reranking in RAG pipeline

Quick answer
Fix slow reranking in a RAG pipeline by batching multiple queries into a single reranker call and caching reranking results to avoid redundant computations. Use efficient models like gpt-4o-mini or claude-sonnet-4-5 and minimize API calls by reranking only top candidates.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable.

  • Run pip install openai
  • Set export OPENAI_API_KEY='your_api_key' on Linux/macOS or setx OPENAI_API_KEY "your_api_key" on Windows
bash
pip install openai

Step by step

This example demonstrates batching multiple query-document pairs for reranking using the OpenAI SDK with gpt-4o-mini. It caches reranking results to speed up repeated queries.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents and queries
queries = ["What is RAG?", "Explain reranking in NLP."]
documents = [
    "RAG stands for Retrieval-Augmented Generation.",
    "Reranking improves retrieval results by ordering candidates.",
    "NLP pipelines often use reranking for better accuracy."
]

# Cache dictionary to store reranking results
rerank_cache = {}

# Prepare batch inputs for reranking
batch_messages = []
for query in queries:
    for doc in documents:
        prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"  
        batch_messages.append({"role": "user", "content": prompt})

# Call reranker model once with all pairs
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=batch_messages,
    temperature=0,
    max_tokens=1
)

# Parse scores and cache
scores = []
for i, choice in enumerate(response.choices):
    score_text = choice.message.content.strip()
    try:
        score = float(score_text)
    except ValueError:
        score = 0.0
    scores.append(score)
    key = (queries[i // len(documents)], documents[i % len(documents)])
    rerank_cache[key] = score

# Display cached reranking scores
for query in queries:
    print(f"Reranking scores for query: '{query}'")
    for doc in documents:
        print(f"  Document: '{doc}' -> Score: {rerank_cache[(query, doc)]:.2f}")
output
Reranking scores for query: 'What is RAG?'
  Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.95
  Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.60
  Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.55
Reranking scores for query: 'Explain reranking in NLP.'
  Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.40
  Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.90
  Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.85

Common variations

To further optimize reranking speed:

  • Use async calls with asyncio and the OpenAI SDK's async client.
  • Choose smaller models like gpt-4o-mini or claude-sonnet-4-5 for faster inference.
  • Implement streaming if supported to process partial results early.
  • Limit reranking to top-k retrieved documents instead of all candidates.
python
import asyncio
import os
from openai import OpenAI

async def async_rerank():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    queries = ["What is RAG?", "Explain reranking in NLP."]
    documents = [
        "RAG stands for Retrieval-Augmented Generation.",
        "Reranking improves retrieval results by ordering candidates.",
        "NLP pipelines often use reranking for better accuracy."
    ]

    batch_messages = []
    for query in queries:
        for doc in documents:
            prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"
            batch_messages.append({"role": "user", "content": prompt})

    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=batch_messages,
        temperature=0,
        max_tokens=1
    )

    for i, choice in enumerate(response.choices):
        print(f"Score {i}: {choice.message.content.strip()}")

asyncio.run(async_rerank())
output
Score 0: 0.95
Score 1: 0.60
Score 2: 0.55
Score 3: 0.40
Score 4: 0.90
Score 5: 0.85

Troubleshooting

If reranking is still slow:

  • Check network latency and retry with a closer API region.
  • Verify you are batching multiple queries instead of calling the API per document.
  • Ensure caching is implemented to avoid duplicate reranking calls.
  • Monitor API rate limits and increase concurrency carefully.
  • Use lighter models or reduce max_tokens to speed up inference.

Key Takeaways

  • Batch multiple query-document pairs into a single API call to reduce latency.
  • Cache reranking results to avoid redundant computations in RAG pipelines.
  • Use smaller, efficient models like gpt-4o-mini or claude-sonnet-4-5 for faster reranking.
  • Limit reranking to top-k candidates to minimize API usage and speed up processing.
  • Implement async calls and streaming when supported to improve throughput.
Verified 2026-04 · gpt-4o-mini, claude-sonnet-4-5
Verify ↗