How to summarize history to manage context
Quick answer
To manage context in large language models with limited
context_window, summarize history by condensing past interactions into concise representations using techniques like embedding-based retrieval or abstractive summarization. This reduces token usage while preserving essential information for coherent responses.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable to access the OpenAI API for summarization and embeddings.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example shows how to summarize conversation history using gpt-4o-mini for abstractive summarization and text-embedding-3-small for embedding-based retrieval to manage context efficiently.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample conversation history
history = [
"User: What is AI?",
"Assistant: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.",
"User: How does machine learning fit in?",
"Assistant: Machine learning is a subset of AI focused on training models to learn from data."
]
# Step 1: Concatenate history for summarization
history_text = "\n".join(history)
# Step 2: Summarize history to reduce token usage
summary_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": f"Summarize the following conversation history concisely:\n{history_text}"}
]
)
summary = summary_response.choices[0].message.content
print("Summary:", summary)
# Step 3: Create embedding of summary for retrieval
embedding_response = client.embeddings.create(
model="text-embedding-3-small",
input=summary
)
summary_embedding = embedding_response.data[0].embedding
print(f"Embedding vector length: {len(summary_embedding)}") output
Summary: The conversation covers AI as the simulation of human intelligence and explains machine learning as a subset focused on data-driven model training. Embedding vector length: 1536
Common variations
You can use asynchronous calls with async and await for better performance in web apps. Different models like gpt-4o or claude-3-5-sonnet-20241022 can be used for summarization depending on quality and cost trade-offs. For large histories, chunk the text and summarize chunks separately before combining summaries.
import asyncio
async def async_summarize(history_text):
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize this conversation:\n{history_text}"}]
)
return response.choices[0].message.content
# Usage example
# summary = asyncio.run(async_summarize(history_text)) Troubleshooting
- If the summary is too long, reduce the input size by chunking history or increasing model temperature for brevity.
- If embeddings are inconsistent, ensure you use the same embedding model for all vectors.
- Watch for token limits; summarize before sending to avoid exceeding
context_window.
Key Takeaways
- Summarize conversation history to fit within the model's context window efficiently.
- Use embeddings of summaries for retrieval-augmented generation to maintain relevant context.
- Chunk large histories and summarize incrementally to avoid token overflow.
- Choose models balancing cost, speed, and summary quality for your use case.