How to reduce RAG costs
Quick answer
To reduce
RAG costs, optimize your embedding model usage by batching and caching embeddings, limit document size with smart chunking, and minimize calls to the LLM by filtering relevant documents before generation. Use cheaper embedding models and leverage vector stores with efficient indexing to reduce expensive API calls.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python SDK and set your API key as an environment variable to interact with OpenAI's embedding and chat models.
pip install openai>=1.0 Step by step
This example demonstrates reducing RAG costs by caching embeddings, chunking documents, and filtering relevant chunks before calling the LLM.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Cache for embeddings to avoid repeated calls
embedding_cache = {}
def get_embedding(text):
if text in embedding_cache:
return embedding_cache[text]
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
embedding = response.data[0].embedding
embedding_cache[text] = embedding
return embedding
# Simple chunking function
def chunk_text(text, max_tokens=500):
words = text.split()
chunks = []
for i in range(0, len(words), max_tokens):
chunk = " ".join(words[i:i+max_tokens])
chunks.append(chunk)
return chunks
# Example documents
documents = [
"OpenAI develops advanced AI models.",
"RAG combines retrieval with generation to improve accuracy.",
"Embedding models convert text to vectors for similarity search."
]
# Precompute embeddings for chunks
chunks = []
for doc in documents:
chunks.extend(chunk_text(doc))
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]
# Dummy similarity filter (e.g., cosine similarity threshold)
def is_relevant(query_embedding, chunk_embedding, threshold=0.8):
# Simplified dot product similarity
dot = sum(q * c for q, c in zip(query_embedding, chunk_embedding))
return dot > threshold
query = "How does RAG reduce costs?"
query_embedding = get_embedding(query)
# Filter relevant chunks
relevant_chunks = [chunk for chunk, emb in zip(chunks, chunk_embeddings) if is_relevant(query_embedding, emb)]
# Prepare prompt with relevant chunks
prompt = """Use the following context to answer the question:\n"""
prompt += "\n---\n".join(relevant_chunks)
prompt += f"\nQuestion: {query}\nAnswer:"""
# Call LLM once with filtered context
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content) output
OpenAI's Retrieval-Augmented Generation (RAG) reduces costs by limiting the amount of text the model processes, focusing only on relevant information retrieved via embeddings, which lowers token usage and API calls.
Common variations
You can reduce costs further by:
- Using cheaper embedding models like
text-embedding-3-smallinstead of larger ones. - Batching embedding requests to reduce overhead.
- Implementing async calls for parallel embedding generation.
- Using vector databases like FAISS or Chroma for efficient similarity search instead of naive filtering.
- Adjusting chunk size to balance context quality and token cost.
Troubleshooting
If you see high costs or latency:
- Check if embeddings are being recomputed unnecessarily; implement caching.
- Verify chunk sizes are not too large, causing excessive token usage.
- Ensure your similarity threshold filters out irrelevant chunks effectively.
- Monitor API usage and optimize model selection for embeddings and generation.
Key Takeaways
- Cache embeddings to avoid repeated costly API calls.
- Chunk documents smartly to limit token usage per query.
- Filter retrieved documents before calling the LLM to reduce generation costs.