Best reranking models for RAG
text-embedding-3-large or claude-3-5-sonnet-20241022 for reranking because they provide high-quality semantic embeddings and strong contextual understanding. These models excel at ranking relevant documents to improve downstream generation accuracy.RECOMMENDATION
text-embedding-3-large from OpenAI for best semantic vector quality and cost efficiency, or claude-3-5-sonnet-20241022 from Anthropic for superior contextual reranking with chat capabilities.| Use case | Best choice | Why | Runner-up |
|---|---|---|---|
| Semantic document reranking | text-embedding-3-large | High-dimensional embeddings with strong semantic accuracy and fast vector search compatibility | claude-3-5-sonnet-20241022 |
| Contextual reranking with chat | claude-3-5-sonnet-20241022 | Chat-based reranking enables nuanced understanding of query-document relevance | gpt-4o |
| Low-latency reranking | text-embedding-3-small | Smaller embedding size for faster inference and lower cost with reasonable accuracy | gpt-4o-mini |
| Multilingual reranking | text-embedding-3-large | Supports multiple languages with robust semantic embeddings | claude-3-5-sonnet-20241022 |
| Open-source/local reranking | Use sentence-transformers/all-MiniLM-L6-v2 | Free, local embeddings with good semantic quality for reranking without API costs | N/A |
Top picks explained
For semantic reranking in RAG, text-embedding-3-large from OpenAI is the top choice due to its high-dimensional embeddings (1536 dimensions) that capture fine-grained semantic relationships, enabling precise document ranking. It integrates seamlessly with vector databases for efficient retrieval.
claude-3-5-sonnet-20241022 from Anthropic excels when reranking requires deeper contextual understanding, leveraging chat-based interactions to assess relevance beyond simple vector similarity. This is ideal for complex queries needing nuanced interpretation.
For cost-sensitive or low-latency scenarios, text-embedding-3-small offers a good balance of speed and accuracy. Open-source models like sentence-transformers/all-MiniLM-L6-v2 provide a free alternative for local reranking without API dependencies.
In practice
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example: rerank documents by embedding similarity
query = "Explain quantum computing"
documents = [
"Quantum computing uses quantum bits.",
"Classical computers use bits.",
"Quantum entanglement is a key resource."
]
# Get query embedding
query_embedding_resp = client.embeddings.create(
model="text-embedding-3-large",
input=query
)
query_vector = query_embedding_resp.data[0].embedding
# Get document embeddings
doc_embeddings = []
for doc in documents:
resp = client.embeddings.create(model="text-embedding-3-large", input=doc)
doc_embeddings.append(resp.data[0].embedding)
# Simple cosine similarity function
def cosine_similarity(a, b):
import numpy as np
a = np.array(a)
b = np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Rank documents by similarity
ranked_docs = sorted(
zip(documents, doc_embeddings),
key=lambda x: cosine_similarity(query_vector, x[1]),
reverse=True
)
for doc, _ in ranked_docs:
print(doc) Quantum computing uses quantum bits. Quantum entanglement is a key resource. Classical computers use bits.
Pricing and limits
| Option | Free | Cost | Limits | Context |
|---|---|---|---|---|
text-embedding-3-large | No free tier | $0.06 / 1K tokens | Max 8192 tokens input | High-quality semantic embeddings, 1536 dims |
text-embedding-3-small | No free tier | $0.02 / 1K tokens | Max 8192 tokens input | Faster, smaller embeddings, 384 dims |
claude-3-5-sonnet-20241022 | No free tier | Approx. $0.03 / 1K tokens | Max 100K tokens context | Chat-based reranking with deep context |
sentence-transformers/all-MiniLM-L6-v2 | Free, open-source | Free | Limited by local hardware | Good semantic embeddings for local use |
What to avoid
- Avoid using older embedding models like
text-embedding-3-smallalone for high-accuracy reranking in complex RAG workflows, as they lack fine semantic granularity. - Do not rely on generic chat models like
gpt-4o-minifor reranking; they are optimized for generation, not vector similarity or ranking. - Avoid using deprecated or low-dimension embeddings that reduce retrieval quality and increase false positives.
- Steer clear of local-only models without fine-tuning if your use case demands domain-specific reranking accuracy.
How to evaluate for your case
Benchmark reranking models by measuring retrieval precision and recall on your domain-specific dataset. Use metrics like Mean Reciprocal Rank (MRR) or Normalized Discounted Cumulative Gain (NDCG). Implement an evaluation script that reranks candidate documents for queries and compares against ground truth relevance labels.
Test latency and cost trade-offs by timing embedding generation and reranking steps. Adjust model choice based on accuracy, speed, and budget constraints.
Key Takeaways
- Use
text-embedding-3-largefor best semantic reranking quality in RAG. - Leverage
claude-3-5-sonnet-20241022for chat-based contextual reranking. - Open-source embeddings like
sentence-transformers/all-MiniLM-L6-v2enable free local reranking. - Avoid generic chat models and low-dimension embeddings for reranking tasks.
- Evaluate reranking models with domain-specific benchmarks and latency tests.