How to Intermediate · 3 min read

How to compress context for LLM

Quick answer

To compress context for an LLM, use techniques like semantic summarization, embedding-based retrieval, or chunking with vector databases to reduce input size while preserving meaning. These methods enable fitting more relevant information within the context window constraints of models like gpt-4o-mini or claude-3-5-sonnet-20241022.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0
pip install faiss-cpu or chromadb (optional for vector search)

Setup

Install the openai Python SDK for LLM calls and optionally faiss-cpu or chromadb for vector similarity search to enable retrieval-augmented compression.

bash

pip install openai faiss-cpu chromadb

output

Collecting openai
Collecting faiss-cpu
Collecting chromadb
Successfully installed openai faiss-cpu chromadb

Step by step

This example shows how to compress long context by chunking text, embedding chunks, and retrieving the most relevant chunks to include in the prompt. It uses text-embedding-3-small for embeddings and gpt-4o-mini for summarization.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample long context
long_text = """\
Large documents or conversations often exceed the context window of LLMs. To handle this, we split the text into chunks, embed each chunk, and store embeddings in a vector store. When querying, we embed the query and retrieve the most relevant chunks to include in the prompt, effectively compressing context.
"""

# Step 1: Chunk the text (simple split by sentences here)
chunks = long_text.split('. ')

# Step 2: Get embeddings for each chunk
embeddings = []
for chunk in chunks:
    response = client.embeddings.create(model="text-embedding-3-small", input=chunk)
    embeddings.append((chunk, response.data[0].embedding))

# Step 3: Embed the query
query = "How to handle large documents with LLMs?"
query_embedding_resp = client.embeddings.create(model="text-embedding-3-small", input=query)
query_embedding = query_embedding_resp.data[0].embedding

# Step 4: Compute cosine similarity and select top chunks
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarities = [(chunk, cosine_similarity(query_embedding, emb)) for chunk, emb in embeddings]
similarities.sort(key=lambda x: x[1], reverse=True)

# Select top 2 chunks
top_chunks = [chunk for chunk, _ in similarities[:2]]

# Step 5: Summarize selected chunks to compress context
prompt = f"Summarize the following text concisely:\n\n{' '.join(top_chunks)}"

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)
summary = response.choices[0].message.content
print("Compressed context summary:", summary)

output

Compressed context summary: To handle large documents with LLMs, split text into chunks, embed them, retrieve relevant chunks via similarity, and summarize to fit within context limits.

Common variations

Use async calls with asyncio for embedding and chat requests to improve throughput.
Replace text-embedding-3-small with other embedding models like text-embedding-3-large for better accuracy.
Use vector databases like FAISS or Chroma for scalable retrieval instead of in-memory similarity.
Apply chunking strategies like overlapping windows or semantic segmentation for better context preservation.

Troubleshooting

If embeddings are slow, batch inputs to the embedding API to reduce latency.
If context is still too large, increase summarization aggressiveness or reduce chunk size.
Ensure your OPENAI_API_KEY is set correctly to avoid authentication errors.
Watch for token limits on the summarization model; chunk and summarize iteratively if needed.

✅

Key Takeaways

Compress context by chunking and embedding text, then retrieving relevant chunks for the prompt.
Use summarization on retrieved chunks to fit more information within the LLM's context window.
Vector databases like FAISS or Chroma enable scalable and efficient context compression.
Batch API calls and iterative summarization help manage token limits and latency.
Always verify your API key and model token limits to avoid runtime errors.

Verified 2026-04 · gpt-4o-mini, text-embedding-3-small, claude-3-5-sonnet-20241022

Verify ↗