How to intermediate · 3 min read

Fix slow reranking in RAG pipeline

Q: Fix slow reranking in RAG pipeline

Fix slow reranking in a RAG pipeline by batching multiple queries into a single reranker call and caching reranking results to avoid redundant computations. Use efficient models like gpt-4o-mini or claude-sonnet-4-5 and minimize API calls by reranking only top candidates.

Quick answer

Fix slow reranking in a RAG pipeline by batching multiple queries into a single reranker call and caching reranking results to avoid redundant computations. Use efficient models like gpt-4o-mini or claude-sonnet-4-5 and minimize API calls by reranking only top candidates.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable.

Run pip install openai
Set export OPENAI_API_KEY='your_api_key' on Linux/macOS or setx OPENAI_API_KEY "your_api_key" on Windows

bash

pip install openai

Step by step

This example demonstrates batching multiple query-document pairs for reranking using the OpenAI SDK with gpt-4o-mini. It caches reranking results to speed up repeated queries.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample documents and queries
queries = ["What is RAG?", "Explain reranking in NLP."]
documents = [
    "RAG stands for Retrieval-Augmented Generation.",
    "Reranking improves retrieval results by ordering candidates.",
    "NLP pipelines often use reranking for better accuracy."
]

# Cache dictionary to store reranking results
rerank_cache = {}

# Prepare batch inputs for reranking
batch_messages = []
for query in queries:
    for doc in documents:
        prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"  
        batch_messages.append({"role": "user", "content": prompt})

# Call reranker model once with all pairs
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=batch_messages,
    temperature=0,
    max_tokens=1
)

# Parse scores and cache
scores = []
for i, choice in enumerate(response.choices):
    score_text = choice.message.content.strip()
    try:
        score = float(score_text)
    except ValueError:
        score = 0.0
    scores.append(score)
    key = (queries[i // len(documents)], documents[i % len(documents)])
    rerank_cache[key] = score

# Display cached reranking scores
for query in queries:
    print(f"Reranking scores for query: '{query}'")
    for doc in documents:
        print(f"  Document: '{doc}' -> Score: {rerank_cache[(query, doc)]:.2f}")

output

Reranking scores for query: 'What is RAG?'
  Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.95
  Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.60
  Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.55
Reranking scores for query: 'Explain reranking in NLP.'
  Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.40
  Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.90
  Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.85

Common variations

To further optimize reranking speed:

Use async calls with asyncio and the OpenAI SDK's async client.
Choose smaller models like gpt-4o-mini or claude-sonnet-4-5 for faster inference.
Implement streaming if supported to process partial results early.
Limit reranking to top-k retrieved documents instead of all candidates.

python

import asyncio
import os
from openai import OpenAI

async def async_rerank():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    queries = ["What is RAG?", "Explain reranking in NLP."]
    documents = [
        "RAG stands for Retrieval-Augmented Generation.",
        "Reranking improves retrieval results by ordering candidates.",
        "NLP pipelines often use reranking for better accuracy."
    ]

    batch_messages = []
    for query in queries:
        for doc in documents:
            prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"
            batch_messages.append({"role": "user", "content": prompt})

    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=batch_messages,
        temperature=0,
        max_tokens=1
    )

    for i, choice in enumerate(response.choices):
        print(f"Score {i}: {choice.message.content.strip()}")

asyncio.run(async_rerank())

output

Score 0: 0.95
Score 1: 0.60
Score 2: 0.55
Score 3: 0.40
Score 4: 0.90
Score 5: 0.85

Troubleshooting

If reranking is still slow:

Check network latency and retry with a closer API region.
Verify you are batching multiple queries instead of calling the API per document.
Ensure caching is implemented to avoid duplicate reranking calls.
Monitor API rate limits and increase concurrency carefully.
Use lighter models or reduce max_tokens to speed up inference.

✅

Key Takeaways

Batch multiple query-document pairs into a single API call to reduce latency.
Cache reranking results to avoid redundant computations in RAG pipelines.
Use smaller, efficient models like gpt-4o-mini or claude-sonnet-4-5 for faster reranking.
Limit reranking to top-k candidates to minimize API usage and speed up processing.
Implement async calls and streaming when supported to improve throughput.

Verified 2026-04 · gpt-4o-mini, claude-sonnet-4-5

Verify ↗