How to improve RAG latency
Quick answer
To improve
RAG latency, optimize your document retrieval by using efficient vector stores like FAISS or Chroma with approximate nearest neighbor search, and reduce LLM call overhead by batching queries or using smaller, faster models such as gpt-4o-mini. Additionally, cache frequent retrieval results and precompute embeddings to minimize runtime delays.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0 faiss-cpu chromadb
Setup efficient vector store
Use a fast vector database like FAISS or Chroma to speed up document retrieval. Precompute and store embeddings to avoid repeated computation during queries.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os
# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings(model_name="gpt-4o-mini", openai_api_key=os.environ["OPENAI_API_KEY"])
# Example: Load documents and create FAISS index
texts = ["Document 1 text", "Document 2 text"]
vector_store = FAISS.from_texts(texts, embeddings)
# Save index for reuse
vector_store.save_local("faiss_index") Step by step RAG query with caching
Combine fast retrieval with caching to reduce repeated LLM calls. Use a smaller model for faster generation and batch queries when possible.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Simulated cache dictionary
cache = {}
# Function to perform RAG query
# 1. Retrieve relevant docs from vector store
# 2. Check cache for prompt
# 3. Call LLM if cache miss
def rag_query(query):
if query in cache:
return cache[query]
# Retrieve top docs (simulate retrieval)
retrieved_docs = ["Document 1 text"] # Replace with actual retrieval
prompt = f"Context: {retrieved_docs[0]}\nQuestion: {query}\nAnswer:"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
cache[query] = answer
return answer
# Example usage
print(rag_query("What is Document 1 about?")) output
Context: Document 1 text Question: What is Document 1 about? Answer: ... (LLM generated answer)
Common variations to reduce latency
- Use approximate nearest neighbor search (e.g., HNSW in FAISS) to speed retrieval.
- Precompute embeddings offline and load at runtime.
- Use smaller or distilled LLMs like
gpt-4o-minifor faster inference. - Batch multiple queries to reduce API overhead.
- Implement asynchronous calls to overlap retrieval and generation.
Troubleshooting latency issues
If retrieval is slow, verify your vector store index is properly built and uses approximate search. If LLM calls are slow, check network latency and consider switching to a smaller model. For cache misses causing delays, increase cache hit rate by caching more queries or pre-warming common prompts.
Key Takeaways
- Use efficient vector stores with approximate nearest neighbor search to speed document retrieval.
- Precompute and cache embeddings and LLM responses to minimize repeated computation.
- Choose smaller, faster LLM models like
gpt-4o-minifor latency-sensitive RAG applications. - Batch and asynchronously process queries to reduce API call overhead.
- Monitor and optimize both retrieval and generation steps separately for best latency gains.