Compact mode: fitting context efficiently
Why this matters
LLMs have fixed context windows: compact mode ensures your queries retrieve only the most relevant information, reducing costs and latency while maintaining accuracy on retrieval-augmented answers.
Explanation
What it is: Compact mode is a retrieval strategy in llama-index that ranks and filters retrieved nodes by relevance, stopping when adding another node would exceed your token limit. Instead of retrieving a fixed number of nodes, it packs nodes intelligently until the context window fills.
How it works mechanically: When you set similarity_top_k high but enable compact=True in your retriever, llama-index first fetches top-k candidates by similarity score. It then iteratively adds nodes to the context in rank order, calculating cumulative tokens. The moment the next node would breach your specified max_tokens, it stops. This means you get variable-length context adapted to your query's token needs.
When to use it: Use compact mode in production systems where you pay per token (OpenAI, Anthropic), when responses must be fast, or when you have strict context window constraints. It's especially valuable for search-heavy applications where irrelevant context hurts both cost and model reasoning quality.
Analogy
Think of compact mode like packing a suitcase for a flight with a weight limit. You don't pack all your clothes: you rank by importance (underwear, toothbrush, socks) and keep adding items until you hit the weight limit. You end up with exactly what you need, no extra baggage.
Code
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"
Settings.llm = OpenAI(model="gpt-4.1", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = index.as_retriever(
similarity_top_k=10,
node_postprocessors=[]
)
query_engine = index.as_query_engine(
retriever=retriever,
text_qa_template="Answer based only on context: {context_str}\n\nQuestion: {query_str}",
)
from llama_index.core.postprocessor import SimilarityPostprocessor
compact_retriever = index.as_retriever(
similarity_top_k=15,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.5)
]
)
compact_query_engine = index.as_query_engine(
retriever=compact_retriever,
text_qa_template="Answer based only on context: {context_str}\n\nQuestion: {query_str}"
)
response = compact_query_engine.query("What are the main benefits of renewable energy?")
print(f"Answer: {response}")
print(f"Retrieved {len(response.source_nodes)} nodes")
for node in response.source_nodes:
print(f" - Score: {node.score:.3f}, Tokens: ~{len(node.get_content().split())}") Answer: Renewable energy sources provide sustainable power generation, reduce carbon emissions, and offer long-term cost savings through decreased fuel dependency. Major benefits include environmental protection, energy independence, and technological job creation. Retrieved 3 nodes - Score: 0.892, Tokens: ~145 - Score: 0.764, Tokens: ~132 - Score: 0.658, Tokens: ~118
What just happened?
The code created a vector index from documents, then configured a retriever to fetch the top 15 most similar nodes. A SimilarityPostprocessor filtered out nodes below a 0.5 similarity threshold, keeping only the highest-quality matches. When the query ran, the engine retrieved 3 nodes (the others fell below the cutoff), and returned an answer synthesized from those compact results. The output shows how many nodes were actually used and their relevance scores.
Common gotcha
Developers often set similarity_top_k=100 expecting compact mode to magically reduce it: but similarity_top_k is just the candidate pool. The actual compaction happens only when you add a SimilarityPostprocessor or token-aware postprocessor. Without a postprocessor, you still retrieve all 100 nodes. Compact mode requires explicit configuration of filtering logic.
Error recovery
ValueError: max_tokens must be positiveNo nodes returned (empty source_nodes)ImportError: cannot import SimilarityPostprocessorExperienced dev note
Compact mode isn't about magic: it's about *explicit token accounting*. Many teams disable it because they don't measure token usage per query. Start by logging how many tokens each retrieved node consumes (use your LLM's tokenizer), then set realistic max_tokens based on your cost model. A 10% reduction in retrieved context often yields 3-5x cost savings on APIs like OpenAI because the retriever stop earlier, and the query processes faster due to smaller input. Profile first, optimize second.
Check your understanding
You're building a support chatbot using GPT-4 with an 8K context window. Your system reserves 2K for the system prompt and response. You retrieve 20 candidate nodes via similarity search. With compact mode configured to stop at 5K tokens, why might you end up retrieving only 4 nodes instead of the full 20? What would you check if you were getting back all 20 nodes anyway?
Show answer hint
A correct answer explains that compact mode stops adding nodes once cumulative token count would exceed the max_tokens threshold: so token counts of individual nodes determine how many fit. If you're getting all 20 nodes, the issue is that no postprocessor is actually enforcing the compaction; you'd verify the postprocessor is configured and its cutoff isn't too loose.