How to Intermediate · 3 min read

How to summarize history to manage context

Quick answer
To manage context in large language models with limited context_window, summarize history by condensing past interactions into concise representations using techniques like embedding-based retrieval or abstractive summarization. This reduces token usage while preserving essential information for coherent responses.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access the OpenAI API for summarization and embeddings.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example shows how to summarize conversation history using gpt-4o-mini for abstractive summarization and text-embedding-3-small for embedding-based retrieval to manage context efficiently.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Sample conversation history
history = [
    "User: What is AI?",
    "Assistant: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines.",
    "User: How does machine learning fit in?",
    "Assistant: Machine learning is a subset of AI focused on training models to learn from data."
]

# Step 1: Concatenate history for summarization
history_text = "\n".join(history)

# Step 2: Summarize history to reduce token usage
summary_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": f"Summarize the following conversation history concisely:\n{history_text}"}
    ]
)
summary = summary_response.choices[0].message.content
print("Summary:", summary)

# Step 3: Create embedding of summary for retrieval
embedding_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=summary
)
summary_embedding = embedding_response.data[0].embedding
print(f"Embedding vector length: {len(summary_embedding)}")
output
Summary: The conversation covers AI as the simulation of human intelligence and explains machine learning as a subset focused on data-driven model training.
Embedding vector length: 1536

Common variations

You can use asynchronous calls with async and await for better performance in web apps. Different models like gpt-4o or claude-3-5-sonnet-20241022 can be used for summarization depending on quality and cost trade-offs. For large histories, chunk the text and summarize chunks separately before combining summaries.

python
import asyncio

async def async_summarize(history_text):
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize this conversation:\n{history_text}"}]
    )
    return response.choices[0].message.content

# Usage example
# summary = asyncio.run(async_summarize(history_text))

Troubleshooting

  • If the summary is too long, reduce the input size by chunking history or increasing model temperature for brevity.
  • If embeddings are inconsistent, ensure you use the same embedding model for all vectors.
  • Watch for token limits; summarize before sending to avoid exceeding context_window.

Key Takeaways

  • Summarize conversation history to fit within the model's context window efficiently.
  • Use embeddings of summaries for retrieval-augmented generation to maintain relevant context.
  • Chunk large histories and summarize incrementally to avoid token overflow.
  • Choose models balancing cost, speed, and summary quality for your use case.
Verified 2026-04 · gpt-4o-mini, text-embedding-3-small, gpt-4o, claude-3-5-sonnet-20241022
Verify ↗