How to Intermediate · 3 min read

How to reduce RAG costs

Quick answer
To reduce RAG costs, optimize your embedding model usage by batching and caching embeddings, limit document size with smart chunking, and minimize calls to the LLM by filtering relevant documents before generation. Use cheaper embedding models and leverage vector stores with efficient indexing to reduce expensive API calls.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python SDK and set your API key as an environment variable to interact with OpenAI's embedding and chat models.

bash
pip install openai>=1.0

Step by step

This example demonstrates reducing RAG costs by caching embeddings, chunking documents, and filtering relevant chunks before calling the LLM.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Cache for embeddings to avoid repeated calls
embedding_cache = {}

def get_embedding(text):
    if text in embedding_cache:
        return embedding_cache[text]
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    embedding = response.data[0].embedding
    embedding_cache[text] = embedding
    return embedding

# Simple chunking function

def chunk_text(text, max_tokens=500):
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_tokens):
        chunk = " ".join(words[i:i+max_tokens])
        chunks.append(chunk)
    return chunks

# Example documents
documents = [
    "OpenAI develops advanced AI models.",
    "RAG combines retrieval with generation to improve accuracy.",
    "Embedding models convert text to vectors for similarity search."
]

# Precompute embeddings for chunks
chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

chunk_embeddings = [get_embedding(chunk) for chunk in chunks]

# Dummy similarity filter (e.g., cosine similarity threshold)
def is_relevant(query_embedding, chunk_embedding, threshold=0.8):
    # Simplified dot product similarity
    dot = sum(q * c for q, c in zip(query_embedding, chunk_embedding))
    return dot > threshold

query = "How does RAG reduce costs?"
query_embedding = get_embedding(query)

# Filter relevant chunks
relevant_chunks = [chunk for chunk, emb in zip(chunks, chunk_embeddings) if is_relevant(query_embedding, emb)]

# Prepare prompt with relevant chunks
prompt = """Use the following context to answer the question:\n"""
prompt += "\n---\n".join(relevant_chunks)
prompt += f"\nQuestion: {query}\nAnswer:"""

# Call LLM once with filtered context
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
output
OpenAI's Retrieval-Augmented Generation (RAG) reduces costs by limiting the amount of text the model processes, focusing only on relevant information retrieved via embeddings, which lowers token usage and API calls.

Common variations

You can reduce costs further by:

  • Using cheaper embedding models like text-embedding-3-small instead of larger ones.
  • Batching embedding requests to reduce overhead.
  • Implementing async calls for parallel embedding generation.
  • Using vector databases like FAISS or Chroma for efficient similarity search instead of naive filtering.
  • Adjusting chunk size to balance context quality and token cost.

Troubleshooting

If you see high costs or latency:

  • Check if embeddings are being recomputed unnecessarily; implement caching.
  • Verify chunk sizes are not too large, causing excessive token usage.
  • Ensure your similarity threshold filters out irrelevant chunks effectively.
  • Monitor API usage and optimize model selection for embeddings and generation.

Key Takeaways

  • Cache embeddings to avoid repeated costly API calls.
  • Chunk documents smartly to limit token usage per query.
  • Filter retrieved documents before calling the LLM to reduce generation costs.
Verified 2026-04 · gpt-4o, text-embedding-3-small
Verify ↗