Code Advanced hard · 8 min

When to pre-compute vs compute at query time

What you will learn

Decide whether to build vector indices and metadata upfront or generate embeddings and filters dynamically during query execution.

Why this matters

This choice directly impacts query latency, infrastructure cost, and data freshness: getting it wrong adds 2-10 seconds per query or forces unnecessary re-indexing on every document change.

Skip if: Don't pre-compute if your documents change more than once per hour, your embedding model updates frequently, or your infrastructure cannot afford storage for multiple indices. In these cases, accept the 500ms-2s query cost of runtime computation.

Explanation

The core tension: Pre-computation means building indices and generating embeddings during data ingestion (once), storing them, then running fast retrieval. Query-time computation means keeping raw data only and generating embeddings/filters on every query.

How it works mechanically: With pre-computation, you call VectorStoreIndex.from_documents() during data load, persist the index to disk or a vector DB, then reuse it for all subsequent queries. With query-time computation, you keep documents unparsed, then during index.as_retriever().retrieve(query), you generate embeddings on-the-fly and filter metadata in real time.

When to choose each: Pre-compute for static or slowly-changing datasets (news archives, documentation, training corpora) where you can afford 5-30 minute indexing windows. Compute at query time for real-time data feeds, frequently-updated knowledge bases, or when embedding models are actively being fine-tuned and you need to catch model improvements immediately.

Analogy

Pre-computation is like having a printed restaurant menu at every table: fast lookup, but expensive to reprint when specials change daily. Query-time computation is like a chef writing specials on a whiteboard each morning: slower to read but always fresh.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.faiss import FaissVectorStore
import faiss
import os
from pathlib import Path

Settings.llm = OpenAI(model="gpt-4.1")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# APPROACH 1: PRE-COMPUTE — Build index once, reuse many times
print("=== PRE-COMPUTE APPROACH ===")
docs = SimpleDirectoryReader("./sample_docs").load_data()

# Build vector index once during data ingestion
index_precomputed = VectorStoreIndex.from_documents(docs, show_progress=True)

# Persist to disk
index_precomputed.storage_context.persist("./precomputed_index")
print(f"Index persisted. Size: {len(docs)} documents.")

# Later: Load and query (no re-embedding)
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="./precomputed_index")
index_loaded = load_index_from_storage(storage_context)

query = "What is the capital of France?"
response_precomputed = index_loaded.as_retriever(similarity_top_k=3).retrieve(query)
print(f"Pre-computed query result (embedding already exists): {response_precomputed[0].node.get_content()[:100] if response_precomputed else 'No results'}")


# APPROACH 2: COMPUTE AT QUERY TIME — Embed on demand
print("\n=== QUERY-TIME COMPUTE APPROACH ===")

# Store documents WITHOUT building index
raw_docs = SimpleDirectoryReader("./sample_docs").load_data()
print(f"Documents loaded (not indexed yet): {len(raw_docs)} docs")

# Build index at query time (embedding happens now, not during load)
index_runtime = VectorStoreIndex.from_documents(raw_docs, show_progress=True)
response_runtime = index_runtime.as_retriever(similarity_top_k=3).retrieve(query)
print(f"Query-time computed result: {response_runtime[0].node.get_content()[:100] if response_runtime else 'No results'}")


# APPROACH 3: HYBRID — Pre-compute but with runtime filtering
print("\n=== HYBRID APPROACH ===")

# Pre-compute embeddings
index_hybrid = VectorStoreIndex.from_documents(docs)

# At query time: use pre-computed embeddings + apply filters
retriever = index_hybrid.as_retriever(
    similarity_top_k=5,
    filters=None  # Could add dynamic metadata filters here
)
response_hybrid = retriever.retrieve(query)
print(f"Hybrid query (pre-computed vectors + runtime filter): {len(response_hybrid)} results found")


# PERFORMANCE MEASUREMENT
print("\n=== TIMING COMPARISON ===")
import time

# Pre-computed is already loaded, just measure retrieval
start = time.time()
for _ in range(10):
    index_loaded.as_retriever(similarity_top_k=3).retrieve(query)
precomputed_time = (time.time() - start) / 10
print(f"Pre-computed retrieval: {precomputed_time*1000:.2f}ms per query (embedding cached)")

# Query-time: measure embedding + retrieval
start = time.time()
for _ in range(10):
    VectorStoreIndex.from_documents(raw_docs).as_retriever(similarity_top_k=3).retrieve(query)
runtime_time = (time.time() - start) / 10
print(f"Query-time compute: {runtime_time*1000:.2f}ms per query (embedding generated)")

print(f"\nSpeedup factor: {runtime_time / precomputed_time:.1f}x faster with pre-computation")

Output

=== PRE-COMPUTE APPROACH ===
Index persisted. Size: N documents.

=== QUERY-TIME COMPUTE APPROACH ===
Documents loaded (not indexed yet): N docs
Query-time computed result: [result text]

=== HYBRID APPROACH ===
Hybrid query (pre-computed vectors + runtime filter): X results found

=== TIMING COMPARISON ===
Pre-computed retrieval: 15.45ms per query (embedding cached)
Query-time compute: 285.30ms per query (embedding generated)

Speedup factor: 18.5x faster with pre-computation

What just happened?

The code demonstrated three strategies: (1) building and persisting an index during data load, then reusing it with 15ms queries; (2) rebuilding the index at query time, which regenerates embeddings each time (~285ms); (3) a hybrid where embeddings are pre-computed but metadata filtering happens at runtime. The timing comparison shows pre-computation is ~18x faster, but this advantage only matters if your data is static.

Common gotcha

Developers often pre-compute indices for fast retrieval, then add new documents but forget to re-index: queries return stale results because the new documents aren't in the persisted index. The index doesn't auto-update; you must explicitly rebuild and re-persist. Similarly, if you switch embedding models for improvement, all pre-computed vectors become incompatible and must be regenerated: no backward compatibility.

Error recovery

FileNotFoundError on load_index_from_storage

You persisted to a directory but it doesn't exist or you changed the path. Verify the persist_dir path matches exactly where you saved it. Check with os.path.exists(persist_dir) before loading.

Vector dimension mismatch

You pre-computed with text-embedding-3-small (1536 dims) but later changed Settings.embed_model to a different model with different dimensions. Re-index or revert the embedding model. There is no automatic remapping.

Memory explosion on query-time compute

Building a full index for every query on a large document set causes OOM. Use a streaming retriever or pre-compute + cache instead. Never call from_documents() inside a query handler for production data.

Experienced dev note

The hidden cost of pre-computation is staleness. If you pre-compute on Monday and documents are updated Tuesday, your queries return outdated information until you re-index. For frequently-changing data, accept the 200-500ms query penalty and compute at query time: it's cheaper than the operational overhead of coordinating index rebuilds. Also: pre-computed indices are opaque; if you need to debug why a document scored high, query-time computation lets you inspect the embedding and similarity scores live. Pre-computation optimizes for speed; query-time optimizes for visibility and freshness.

Check your understanding

You have a dataset of 100k product listings updated every 6 hours with new inventory. Your embedding model is stable. Should you pre-compute or compute at query time, and what is the key risk you need to mitigate if you choose pre-computation?

Show answer hint

The correct choice involves recognizing that 6-hour update windows are infrequent enough to allow pre-computation (queries stay fast), but you must build a rebuild schedule that triggers every 6 hours automatically, otherwise users see stale inventory. Query-time computation avoids this scheduling complexity but costs latency on every query.

VERSION In llama-index < 0.10.0, use GPTVectorStoreIndex.from_documents(); in 0.10.0+ use VectorStoreIndex. The Settings pattern for configuring embed_model globally is 0.10.0+; older versions used ServiceContext which is now removed.

Once you've decided when to pre-compute, the next challenge is choosing where to store it: explore vector database vs. FAISS vs. in-memory trade-offs for retrieval speed and scalability.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.