Workflow Advanced hard · 10 min decision_step

Cost per query optimization

What you will learn
Decide between retrieval strategies and reranking depth to minimize embedding API + LLM costs per query while maintaining relevance.
Step 4 in the RAG pipeline: After setting up retrieval (step 3) and before measuring relevance (step 5). This is a cost-vs-quality trade-off checkpoint.

Why this matters

Naive retrieval costs multiply: every query triggers embeddings for the query + reranking across hundreds of docs + LLM calls on top-k results. At 10,000 queries/day, choosing wrong here costs $500+/month in unnecessary API calls. Skip this and your RAG system becomes prohibitively expensive to scale.

Explanation

The cost problem: Each query pays for: (1) query embedding, (2) vector search across indexed docs, (3) reranker scoring on top-K results, (4) LLM inference on reranked results. Standard RAG retrieves 20 docs and reranks all 20. At scale, that's expensive.

Your levers: You control three variables: (1) initial retrieval K (how many docs to pull from vector DB), (2) reranking budget (how many of those K to rerank), (3) retrieval strategy (single query vs. multi-query vs. HyDE). Each lever trades cost for relevance.

What to watch: The cost isn't just the reranker call: it's the hidden cost of retrieving documents you won't rerank. Retrieving 100 docs then reranking only 10 wastes vector DB compute. Also, multi-query and HyDE multiply your embedding costs by 3-5x, so only use them if single-query retrieval is missing critical docs. Measure empirically: track cost-per-successful-query (queries that led to useful LLM responses), not cost-per-query.

Code

Illustrative only - not runnable without a valid API key
python
# pip install langchain langchain-community langchain-openai langchain-cohere chroma-db

import json
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_cohere import CohereRerank
from langchain_community.vectorstores import Chroma
from langchain.retrievers import MultiQueryRetriever, ContextualCompressionRetriever

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

vector_store = Chroma(
    collection_name="sample_docs",
    embedding_function=embeddings,
    persist_directory="./chroma_data"
)

compressor = CohereRerank(model="rerank-english-v3.0", top_n=10)

print("\n=== STRATEGY 1: Single-query + shallow reranking ===")
retriever_1 = vector_store.as_retriever(search_kwargs={"k": 10})
compression_retriever_1 = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever_1
)
query = "What is the impact of inflation on stock valuations?"
docs_1 = compression_retriever_1.invoke(query)
print(f"Query: {query}")
print(f"Documents retrieved: {len(docs_1)}")
print(f"Estimated cost: ~$0.0002/query (1 embedding + 1 rerank)")

print("\n=== STRATEGY 2: Multi-query + medium reranking ===")
retriever_2 = vector_store.as_retriever(search_kwargs={"k": 20})
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever_2,
    llm=llm
)
compressor_2 = CohereRerank(model="rerank-english-v3.0", top_n=10)
compression_retriever_2 = ContextualCompressionRetriever(
    base_compressor=compressor_2,
    base_retriever=multi_query_retriever
)
docs_2 = compression_retriever_2.invoke(query)
print(f"Query: {query}")
print(f"Documents retrieved: {len(docs_2)}")
print(f"Estimated cost: ~$0.0006/query (3 embeddings + 1 rerank)")

print("\n=== STRATEGY 3: Cost analysis comparison ===")
strategies = {
    "Single-query (K=10, rerank=10)": {
        "embedding_calls": 1,
        "rerank_docs": 10,
        "cost_per_query": 0.0002,
        "latency_ms": 150
    },
    "Multi-query (K=20, rerank=10)": {
        "embedding_calls": 3,
        "rerank_docs": 10,
        "cost_per_query": 0.0006,
        "latency_ms": 350
    },
    "HyDE (K=30, rerank=5)": {
        "embedding_calls": 4,
        "rerank_docs": 5,
        "cost_per_query": 0.001,
        "latency_ms": 600
    }
}

for strategy_name, metrics in strategies.items():
    daily_cost_10k_queries = metrics["cost_per_query"] * 10000
    print(f"{strategy_name}:")
    print(f"  Cost/query: ${metrics['cost_per_query']:.4f}")
    print(f"  Daily cost (10k queries): ${daily_cost_10k_queries:.2f}")
    print(f"  Monthly cost: ${daily_cost_10k_queries * 30:.2f}")
    print(f"  Latency: {metrics['latency_ms']}ms\n")

print("\n=== DECISION TREE ===")
print("1. If monthly query volume > 50k AND cost is primary constraint:")
print("   → Choose Single-query strategy")
print("2. If monthly query volume 5k-50k AND relevance matters:")
print("   → Choose Multi-query strategy")
print("3. If monthly query volume < 5k AND correctness is critical:")
print("   → Choose HyDE strategy")
Output
=== STRATEGY 1: Single-query + shallow reranking ===
Query: What is the impact of inflation on stock valuations?
Documents retrieved: 0
Estimated cost: ~$0.0002/query (1 embedding + 1 rerank)

=== STRATEGY 2: Multi-query + medium reranking ===
Query: What is the impact of inflation on stock valuations?
Documents retrieved: 0
Estimated cost: ~$0.0006/query (3 embeddings + 1 rerank)

=== STRATEGY 3: Cost analysis comparison ===
Single-query (K=10, rerank=10):
  Cost/query: $0.0002
  Daily cost (10k queries): $2.00
  Monthly cost: $60.00
  Latency: 150ms

Multi-query (K=20, rerank=10):
  Cost/query: $0.0006
  Daily cost (10k queries): $6.00
  Monthly cost: $180.00
  Latency: 350ms

HyDE (K=30, rerank=5):
  Cost/query: $0.001
  Daily cost (10k queries): $10.00
  Monthly cost: $300.00
  Latency: 600ms

=== DECISION TREE ===
1. If monthly query volume > 50k AND cost is primary constraint:
   → Choose Single-query strategy
2. If monthly query volume 5k-50k AND relevance matters:
   → Choose Multi-query strategy
3. If monthly query volume < 5k AND correctness is critical:
   → Choose HyDE strategy

Your options

Multi-query retrieval with medium reranking (K=20, rerank top-10)

Medium query volume (1k-5k/day), ambiguous domain (medicine, law), acceptable cost increase for relevance improvement. Balances cost and coverage.

Pros

Query expansion catches phrasing variations. Retrieves 20 docs (broader coverage) but only reranks top-10 (cost control). Works well for complex domains where a single query misses context.

Cons

3x embedding cost (1 original + 2 variants). Reranker still runs once but filters down. Latency +100-200ms. Complexity in monitoring which queries benefit.

from langchain.retrievers import MultiQueryRetriever
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
vector_store = Chroma(collection_name="docs", embedding_function=embeddings)

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_kwargs={"k": 20}),
    llm=llm,
    prompt=None
)

compressor = CohereRerank(model="rerank-english-v3.0", top_n=10)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=multi_query_retriever
)

docs = compression_retriever.invoke("How do interest rate hikes affect bond values?")
print(f"Retrieved {len(docs)} docs with 3x embedding cost, reranked top-10")

HyDE (Hypothetical Document Embeddings) with targeted reranking (K=30, rerank top-5)

Low query volume (< 1k/day), expensive relevance failures (medical/legal), where cost per query is secondary to correctness. High precision needed.

Pros

Generates synthetic documents semantically aligned to answer. Catches conceptually relevant docs that keyword-based retrieval misses. Best recall of the three options. Only reranks tiny subset (top-5) to control cost.

Cons

Highest latency (LLM generates hypothetical doc first). 4x embedding cost. Reranker on tiny set risks missing nuance. Requires careful LLM tuning. Cost ~$0.0008-0.0015/query.

from langchain.retrievers import HyDERetriever
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_cohere import CohereRerank
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.vectorstores import Chroma

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma(collection_name="docs", embedding_function=embeddings)

hyde_retriever = HyDERetriever(
    llm=llm,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 30})
)

compressor = CohereRerank(model="rerank-english-v3.0", top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=hyde_retriever
)

docs = compression_retriever.invoke("Mechanism of action of ACE inhibitors in hypertension")
print(f"Retrieved {len(docs)} high-precision docs, cost ~$0.001/query")

Validation step

Measure cost-per-successful-query, not cost-per-query. Log: (1) whether the retrieved doc was actually used by the LLM to generate the final answer, (2) whether the answer was rated as correct by user feedback or evaluation metric. If < 70% of queries produce useful top-k results, your K value is too low. If cost exceeds 3x your LLM inference cost (usually the 20-40% of RAG cost), your retrieval strategy is inefficient. Use instrumentation: `retriever.invoke()` returns docs: inspect `doc.metadata["relevance_score"]` (if reranker provides it) to confirm top-k has scores > 0.5.

At scale

Below 1k queries/day: cost optimization is premature. Above 50k queries/day with single-query strategy: vector DB latency (not API cost) becomes bottleneck. Multi-query retrieval multiplies your vector DB QPS by 3; a 1000 QPS DB now needs to handle 3000 QPS. At 100k+ daily queries, the embedding API cost (not reranker) dominates: switching from text-embedding-3-large ($0.00013/1k tokens) to text-embedding-3-small ($0.00002/1k tokens) saves $3k/month on multi-query alone. Also: reranker API charges per document reranked; reranking 100 docs instead of 20 is 5x the reranker cost, even if your initial retrieval K is 100.

Rollback plan

If you deploy a strategy and relevance drops (queries > 2 hops away from correct answer, user click-through falls 20%+), immediately revert to Multi-query (the safe middle ground). If you're on Single-query and relevance is poor: measure how many queries were solvable with top-10 vs. how many needed top-20. If > 15% need top-20, switch to Multi-query. If you're on HyDE and latency is unacceptable (> 1 second), fall back to Multi-query and use a smaller LLM for hypothesis generation (gpt-4o-mini instead of gpt-4). If reranker API fails (quota exceeded, timeout), have fallback: keep top-K results unsorted (cheaper) or use a lightweight local reranker (cross-encoder model).

Debug symptoms

Your RAG system works in dev (< 100 queries/day) but suddenly costs $800/month in production (10k queries/day). No code changed.

Diagnosis

You deployed with K=50 and rerank_top_n=50 (overkill for dev testing). At 10k queries, this runs 10k reranker calls on 50 docs each = 500k doc-reranks/month. Cohere charges per doc reranked.

Fix

Audit your retriever config. Set search_kwargs={"k": 10} and CohereRerank(top_n=10). For 10k queries: 10k * 10 = 100k doc-reranks/month = ~$15/month. Measure impact on relevance; if it drops, increase only K (not top_n).

Multi-query retrieval worked great in pilot (500 queries), but causes 4-second latencies at 5k concurrent users. Reranker API starts timing out.

Diagnosis

Multi-query strategy makes 3 embedding calls per query. At 5000 concurrent users, you're hitting the embedding API with 15k concurrent requests. Rate limiting + queuing causes cascading timeout. Reranker queues up behind embedding requests.

Fix

Add query caching: if same query arrives within 60 seconds, reuse embeddings. Use async/batch embedding: `embeddings.embed_documents(batch_of_queries)` instead of serial. Or downgrade to single-query strategy at high concurrency (implement circuit breaker: if embedding API p95 latency > 500ms, switch to single-query mode).

HyDE strategy retrieves correct documents but LLM still generates wrong answers. Hypothesis documents contain irrelevant info.

Diagnosis

The LLM used for hypothesis generation (usually GPT-4-mini) hallucinates or generates overly broad hypothetical documents. These embed as far from the query as random documents, defeating the purpose. You're paying 4x embedding cost for noise.

Fix

Add a validation step: after hypothesis generation, re-embed the original query and only keep hypothetical docs within cosine_distance < 0.2 of original query embedding. Or: use a smaller, more factual LLM for hypothesis generation (Llama 2-7B or Mistral), even if it's locally hosted. Prompt the LLM explicitly: 'Generate a realistic document excerpt (not a summary or overview) that answers this question: ...'

Production upgrade path

Tutorial version: pick a strategy in code and hope it works. Production version: (1) instrument all three strategies in shadow mode (run all three, log results, but only use one for the user). (2) A/B test them: 10% of traffic to multi-query, 90% to single-query. Measure cost + correctness. (3) Implement dynamic strategy selection: if embedding API latency > 300ms, switch to single-query. If user's first query didn't have a good result (negative feedback), retry next query with multi-query. (4) Add a cost budget: set max cost per query ($0.0005), and if any strategy exceeds it, alert ops. (5) Cache strategy decision: store which strategy worked best for each domain/user type, reuse it for similar queries.

Common gotcha

Developers often forget that K (initial retrieval count) and rerank top_n are independent levers. They set K=20 and top_n=5, thinking they're querying 5 docs total: but they're actually retrieving 20 from the vector DB and discarding 15, wasting compute. Worse: they measure only reranker cost and miss the hidden vector DB cost. Also: embedding cost scales with query length. A 10-word query costs less to embed than a 100-word query. Multi-query strategies that generate long hypothetical documents can cost 2x more than expected. Track actual token counts: `embeddings.embed_documents([doc])` returns the embedding, but log `len(tiktoken.encoding_for_model("text-embedding-3-small").encode(doc))` to catch surprises.

Experienced dev note

Most teams waste 60% of their RAG budget on retrieval decisions they never consciously made. The default LangChain setup (K=4, no reranking) is for tutorials. Production systems almost always need K >= 10 because vector similarity alone misses 20-30% of relevant docs (false negatives from embedding model limitations). The real tradeoff isn't single-query vs. multi-query: it's retrieval quality vs. latency. Multi-query adds 100-200ms but catches edge cases that would otherwise require users to rephrase. If your deployment has high user friction from poor results (rephrase count high), multi-query pays for itself. If your deployment is cost-sensitive and users rarely rephrase (well-indexed domain, clear queries), single-query wins. Also: reranking with Cohere is expensive relative to vector-only ranking, but it halves your false-positive rate. The cost of a reranker call is often less than the cost of a bad LLM answer that confuses the user. Track end-to-end success metric (correctness of final LLM answer), not retrieval metrics (precision@10).

Check your understanding

You have 50,000 queries/month in production. Your single-query strategy (K=10, rerank all 10) costs $45/month and achieves 78% correctness. Your team proposes switching to multi-query (K=20, rerank top-10) to improve to 85% correctness. What's the new monthly cost, and is the trade-off worth it?

Show answer hint

Single-query: 1 embedding + 10 reranks per query. Multi-query: 3 embeddings + 10 reranks. Calculate embedding cost (text-embedding-3-small ~$0.00002/1k tokens) + rerank cost (Cohere ~$0.0001 per doc) for 50k queries/month. Then ask: does 7 percentage points of correctness justify the cost increase? (Usually yes if that translates to fewer user support tickets or rephrase cycles.)

Community Notes

No notes yetBe the first to share a version-specific fix or tip.