Fix slow reranking in RAG pipeline
Quick answer
Fix slow reranking in a
RAG pipeline by batching multiple queries into a single reranker call and caching reranking results to avoid redundant computations. Use efficient models like gpt-4o-mini or claude-sonnet-4-5 and minimize API calls by reranking only top candidates.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable.
- Run
pip install openai - Set
export OPENAI_API_KEY='your_api_key'on Linux/macOS orsetx OPENAI_API_KEY "your_api_key"on Windows
pip install openai Step by step
This example demonstrates batching multiple query-document pairs for reranking using the OpenAI SDK with gpt-4o-mini. It caches reranking results to speed up repeated queries.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample documents and queries
queries = ["What is RAG?", "Explain reranking in NLP."]
documents = [
"RAG stands for Retrieval-Augmented Generation.",
"Reranking improves retrieval results by ordering candidates.",
"NLP pipelines often use reranking for better accuracy."
]
# Cache dictionary to store reranking results
rerank_cache = {}
# Prepare batch inputs for reranking
batch_messages = []
for query in queries:
for doc in documents:
prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"
batch_messages.append({"role": "user", "content": prompt})
# Call reranker model once with all pairs
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=batch_messages,
temperature=0,
max_tokens=1
)
# Parse scores and cache
scores = []
for i, choice in enumerate(response.choices):
score_text = choice.message.content.strip()
try:
score = float(score_text)
except ValueError:
score = 0.0
scores.append(score)
key = (queries[i // len(documents)], documents[i % len(documents)])
rerank_cache[key] = score
# Display cached reranking scores
for query in queries:
print(f"Reranking scores for query: '{query}'")
for doc in documents:
print(f" Document: '{doc}' -> Score: {rerank_cache[(query, doc)]:.2f}") output
Reranking scores for query: 'What is RAG?' Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.95 Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.60 Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.55 Reranking scores for query: 'Explain reranking in NLP.' Document: 'RAG stands for Retrieval-Augmented Generation.' -> Score: 0.40 Document: 'Reranking improves retrieval results by ordering candidates.' -> Score: 0.90 Document: 'NLP pipelines often use reranking for better accuracy.' -> Score: 0.85
Common variations
To further optimize reranking speed:
- Use async calls with
asyncioand the OpenAI SDK's async client. - Choose smaller models like
gpt-4o-miniorclaude-sonnet-4-5for faster inference. - Implement streaming if supported to process partial results early.
- Limit reranking to top-k retrieved documents instead of all candidates.
import asyncio
import os
from openai import OpenAI
async def async_rerank():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
queries = ["What is RAG?", "Explain reranking in NLP."]
documents = [
"RAG stands for Retrieval-Augmented Generation.",
"Reranking improves retrieval results by ordering candidates.",
"NLP pipelines often use reranking for better accuracy."
]
batch_messages = []
for query in queries:
for doc in documents:
prompt = f"Rerank relevance of this document to the query:\nQuery: {query}\nDocument: {doc}\nScore from 0 to 1:"
batch_messages.append({"role": "user", "content": prompt})
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=batch_messages,
temperature=0,
max_tokens=1
)
for i, choice in enumerate(response.choices):
print(f"Score {i}: {choice.message.content.strip()}")
asyncio.run(async_rerank()) output
Score 0: 0.95 Score 1: 0.60 Score 2: 0.55 Score 3: 0.40 Score 4: 0.90 Score 5: 0.85
Troubleshooting
If reranking is still slow:
- Check network latency and retry with a closer API region.
- Verify you are batching multiple queries instead of calling the API per document.
- Ensure caching is implemented to avoid duplicate reranking calls.
- Monitor API rate limits and increase concurrency carefully.
- Use lighter models or reduce
max_tokensto speed up inference.
Key Takeaways
- Batch multiple query-document pairs into a single API call to reduce latency.
- Cache reranking results to avoid redundant computations in RAG pipelines.
- Use smaller, efficient models like
gpt-4o-miniorclaude-sonnet-4-5for faster reranking. - Limit reranking to top-k candidates to minimize API usage and speed up processing.
- Implement async calls and streaming when supported to improve throughput.