How to intermediate · 4 min read

How to improve RAG latency

Q: How to improve RAG latency

To improve RAG latency, optimize your document retrieval by using efficient vector stores like FAISS or Chroma with approximate nearest neighbor search, and reduce LLM call overhead by batching queries or using smaller, faster models such as gpt-4o-mini. Additionally, cache frequent retrieval results and precompute embeddings to minimize runtime delays.

Quick answer

To improve RAG latency, optimize your document retrieval by using efficient vector stores like FAISS or Chroma with approximate nearest neighbor search, and reduce LLM call overhead by batching queries or using smaller, faster models such as gpt-4o-mini. Additionally, cache frequent retrieval results and precompute embeddings to minimize runtime delays.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0 faiss-cpu chromadb

Setup efficient vector store

Use a fast vector database like FAISS or Chroma to speed up document retrieval. Precompute and store embeddings to avoid repeated computation during queries.

python

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
import os

# Initialize embeddings and vector store
embeddings = OpenAIEmbeddings(model_name="gpt-4o-mini", openai_api_key=os.environ["OPENAI_API_KEY"])

# Example: Load documents and create FAISS index
texts = ["Document 1 text", "Document 2 text"]
vector_store = FAISS.from_texts(texts, embeddings)

# Save index for reuse
vector_store.save_local("faiss_index")

Step by step RAG query with caching

Combine fast retrieval with caching to reduce repeated LLM calls. Use a smaller model for faster generation and batch queries when possible.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Simulated cache dictionary
cache = {}

# Function to perform RAG query
# 1. Retrieve relevant docs from vector store
# 2. Check cache for prompt
# 3. Call LLM if cache miss

def rag_query(query):
    if query in cache:
        return cache[query]
    
    # Retrieve top docs (simulate retrieval)
    retrieved_docs = ["Document 1 text"]  # Replace with actual retrieval
    prompt = f"Context: {retrieved_docs[0]}\nQuestion: {query}\nAnswer:"  
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    cache[query] = answer
    return answer

# Example usage
print(rag_query("What is Document 1 about?"))

output

Context: Document 1 text
Question: What is Document 1 about?
Answer: ... (LLM generated answer)

Common variations to reduce latency

Use approximate nearest neighbor search (e.g., HNSW in FAISS) to speed retrieval.
Precompute embeddings offline and load at runtime.
Use smaller or distilled LLMs like gpt-4o-mini for faster inference.
Batch multiple queries to reduce API overhead.
Implement asynchronous calls to overlap retrieval and generation.

Troubleshooting latency issues

If retrieval is slow, verify your vector store index is properly built and uses approximate search. If LLM calls are slow, check network latency and consider switching to a smaller model. For cache misses causing delays, increase cache hit rate by caching more queries or pre-warming common prompts.

✅

Key Takeaways

Use efficient vector stores with approximate nearest neighbor search to speed document retrieval.
Precompute and cache embeddings and LLM responses to minimize repeated computation.
Choose smaller, faster LLM models like gpt-4o-mini for latency-sensitive RAG applications.
Batch and asynchronously process queries to reduce API call overhead.
Monitor and optimize both retrieval and generation steps separately for best latency gains.

Verified 2026-04 · gpt-4o-mini, gpt-4o

Verify ↗