Code Intermediate medium · 6 min

Compact mode: fitting context efficiently

What you will learn

Compact mode intelligently selects the most relevant nodes from your index to fit within token limits without losing answer quality.

Why this matters

LLMs have fixed context windows: compact mode ensures your queries retrieve only the most relevant information, reducing costs and latency while maintaining accuracy on retrieval-augmented answers.

Skip if: Do not use compact mode when you need exhaustive retrieval (compliance audits, legal discovery) or when your retrieved context is already smaller than your token budget. Also skip compact mode if you're using an LLM with a very large context window (e.g., Claude 200K) where token efficiency is not a constraint.

Explanation

What it is: Compact mode is a retrieval strategy in llama-index that ranks and filters retrieved nodes by relevance, stopping when adding another node would exceed your token limit. Instead of retrieving a fixed number of nodes, it packs nodes intelligently until the context window fills.

How it works mechanically: When you set similarity_top_k high but enable compact=True in your retriever, llama-index first fetches top-k candidates by similarity score. It then iteratively adds nodes to the context in rank order, calculating cumulative tokens. The moment the next node would breach your specified max_tokens, it stops. This means you get variable-length context adapted to your query's token needs.

When to use it: Use compact mode in production systems where you pay per token (OpenAI, Anthropic), when responses must be fast, or when you have strict context window constraints. It's especially valuable for search-heavy applications where irrelevant context hurts both cost and model reasoning quality.

Analogy

Think of compact mode like packing a suitcase for a flight with a weight limit. You don't pack all your clothes: you rank by importance (underwear, toothbrush, socks) and keep adding items until you hit the weight limit. You end up with exactly what you need, no extra baggage.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import os

os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

Settings.llm = OpenAI(model="gpt-4.1", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

retriever = index.as_retriever(
    similarity_top_k=10,
    node_postprocessors=[]
)

query_engine = index.as_query_engine(
    retriever=retriever,
    text_qa_template="Answer based only on context: {context_str}\n\nQuestion: {query_str}",
)

from llama_index.core.postprocessor import SimilarityPostprocessor

compact_retriever = index.as_retriever(
    similarity_top_k=15,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.5)
    ]
)

compact_query_engine = index.as_query_engine(
    retriever=compact_retriever,
    text_qa_template="Answer based only on context: {context_str}\n\nQuestion: {query_str}"
)

response = compact_query_engine.query("What are the main benefits of renewable energy?")
print(f"Answer: {response}")
print(f"Retrieved {len(response.source_nodes)} nodes")
for node in response.source_nodes:
    print(f"  - Score: {node.score:.3f}, Tokens: ~{len(node.get_content().split())}")

Output

Answer: Renewable energy sources provide sustainable power generation, reduce carbon emissions, and offer long-term cost savings through decreased fuel dependency. Major benefits include environmental protection, energy independence, and technological job creation.
Retrieved 3 nodes
  - Score: 0.892, Tokens: ~145
  - Score: 0.764, Tokens: ~132
  - Score: 0.658, Tokens: ~118

What just happened?

The code created a vector index from documents, then configured a retriever to fetch the top 15 most similar nodes. A SimilarityPostprocessor filtered out nodes below a 0.5 similarity threshold, keeping only the highest-quality matches. When the query ran, the engine retrieved 3 nodes (the others fell below the cutoff), and returned an answer synthesized from those compact results. The output shows how many nodes were actually used and their relevance scores.

Common gotcha

Developers often set similarity_top_k=100 expecting compact mode to magically reduce it: but similarity_top_k is just the candidate pool. The actual compaction happens only when you add a SimilarityPostprocessor or token-aware postprocessor. Without a postprocessor, you still retrieve all 100 nodes. Compact mode requires explicit configuration of filtering logic.

Error recovery

ValueError: max_tokens must be positive

You set max_tokens to 0 or negative. Set it to your actual token budget, e.g., max_tokens=2000 for a 4K context window with 2K reserved for output.

No nodes returned (empty source_nodes)

Your similarity_cutoff is too high, filtering out all nodes. Lower it (e.g., 0.3 instead of 0.8) or remove the SimilarityPostprocessor entirely to test if retrieval itself works.

ImportError: cannot import SimilarityPostprocessor

Update llama-index: pip install --upgrade llama-index-core. You're likely on 0.9.x or earlier. Version 0.12.x includes full postprocessor support.

Experienced dev note

Compact mode isn't about magic: it's about *explicit token accounting*. Many teams disable it because they don't measure token usage per query. Start by logging how many tokens each retrieved node consumes (use your LLM's tokenizer), then set realistic max_tokens based on your cost model. A 10% reduction in retrieved context often yields 3-5x cost savings on APIs like OpenAI because the retriever stop earlier, and the query processes faster due to smaller input. Profile first, optimize second.

Check your understanding

You're building a support chatbot using GPT-4 with an 8K context window. Your system reserves 2K for the system prompt and response. You retrieve 20 candidate nodes via similarity search. With compact mode configured to stop at 5K tokens, why might you end up retrieving only 4 nodes instead of the full 20? What would you check if you were getting back all 20 nodes anyway?

Show answer hint

A correct answer explains that compact mode stops adding nodes once cumulative token count would exceed the max_tokens threshold: so token counts of individual nodes determine how many fit. If you're getting all 20 nodes, the issue is that no postprocessor is actually enforcing the compaction; you'd verify the postprocessor is configured and its cutoff isn't too loose.

VERSION In llama-index-core < 0.10.0, compact mode was handled via index-level settings. Version 0.10.0+ moved this to the retriever and postprocessor level. Ensure you're on 0.12.x (April 2026 stable) to use the patterns shown here.

Next, explore node postprocessors beyond similarity filtering: metadata-based filtering, reranking, and diversity boosters that work alongside compact mode for even smarter context selection.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.