Summarization memory: compressing long histories
Why this matters
Long conversations blow up your token budget and API costs. Summarization memory automatically condenses old exchanges into a compact summary, keeping recent messages intact: essential for chatbots that need to stay cheap and responsive over multi-turn conversations.
Explanation
Summarization memory is a LangChain memory strategy that automatically compresses older conversation turns into a summary while keeping recent messages in full. Instead of storing the entire history (which grows linearly), it maintains a condensed summary of what happened earlier plus the last N full messages, dramatically reducing token usage. Mechanically, it works by detecting when the conversation exceeds a token threshold, then triggering an LLM call to summarize everything before the recent window into a single paragraph, replacing that older content with the summary. This is ConversationSummaryMemory in LangChain 1.2.x: it hooks into the memory system and triggers during save_context() calls. Use this when you're building long-running conversational agents where users may have multi-hour sessions, cost is a real constraint, and you can tolerate slight loss of detail in old exchanges for the sake of token savings.
Analogy
Think of a teacher grading essays over a semester. Instead of reviewing every single essay from day one, they write a summary: 'First month: student struggled with thesis statements but improved week 2 onward.' Then they keep the last 5 essays in full. When summarizing happens mid-semester, old essays get rolled into that summary paragraph. Recent work stays detailed.
Code
import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain.memory import ConversationSummaryMemory
api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
raise ValueError('Set OPENAI_API_KEY environment variable')
llm = ChatOpenAI(model='gpt-4o-mini', api_key=api_key)
memory = ConversationSummaryMemory(
llm=llm,
buffer="The conversation is between a user and an AI assistant."
)
for i in range(5):
human_input = f"Question {i+1}: Tell me about topic number {i+1}"
print(f"Human: {human_input}")
memory.save_context(
{"input": human_input},
{"output": f"Here is information about topic {i+1}. It is important and useful."}
)
print(f"Memory buffer after turn {i+1}:")
print(memory.buffer)
print("---")
print()
print("\nFinal memory state:")
print(memory.buffer) Human: Question 1: Tell me about topic number 1 Memory buffer after turn 1: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. --- Human: Question 2: Tell me about topic number 2 Memory buffer after turn 2: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. Human: Question 2: Tell me about topic number 2 AI: Here is information about topic 2. It is important and useful. --- Human: Question 3: Tell me about topic number 3 Memory buffer after turn 3: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. Human: Question 2: Tell me about topic number 2 AI: Here is information about topic 2. It is important and useful. Human: Question 3: Tell me about topic number 3 AI: Here is information about topic 3. It is important and useful. --- Human: Question 4: Tell me about topic number 4 Memory buffer after turn 4: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. Human: Question 2: Tell me about topic number 2 AI: Here is information about topic 2. It is important and useful. Human: Question 3: Tell me about topic number 3 AI: Here is information about topic 3. It is important and useful. Human: Question 4: Tell me about topic number 4 AI: Here is information about topic 4. It is important and useful. --- Human: Question 5: Tell me about topic number 5 Memory buffer after turn 5: Progressively summarizing new lines of the conversation. Processed the following message: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. Human: Question 2: Tell me about topic number 2 AI: Here is information about topic 2. It is important and useful. Summary so far: The user and AI engaged in an informative discussion about two topics. In the first exchange, the user asked about topic 1, and the AI provided useful information on the subject. Subsequently, the user inquired about topic 2, and the AI furnished relevant information about that topic as well. Human: Question 3: Tell me about topic number 3 AI: Here is information about topic 3. It is important and useful. Human: Question 4: Tell me about topic number 4 AI: Here is information about topic 4. It is important and useful. Human: Question 5: Tell me about topic number 5 AI: Here is information about topic 5. It is important and useful. --- Final memory state: Progressively summarizing new lines of the conversation. Processed the following message: Human: Question 1: Tell me about topic number 1 AI: Here is information about topic 1. It is important and useful. Human: Question 2: Tell me about topic number 2 AI: Here is information about topic 2. It is important and useful. Summary so far: The user and AI engaged in an informative discussion about two topics. In the first exchange, the user asked about topic 1, and the AI provided useful information on the subject. Subsequently, the user inquired about topic 2, and the AI furnished relevant information about that topic as well. Human: Question 3: Tell me about topic number 3 AI: Here is information about topic 3. It is important and useful. Human: Question 4: Tell me about topic number 4 AI: Here is information about topic 4. It is important and useful. Human: Question 5: Tell me about topic number 5 AI: Here is information about topic 5. It is important and useful.
What just happened?
The code created a <code>ConversationSummaryMemory</code> instance backed by GPT-4o-mini. It then looped 5 times, each iteration saving a human question and AI response. Around turn 3-4, the memory buffer grew large enough that <code>ConversationSummaryMemory</code> triggered an automatic summarization: the LLM condensed the first two exchanges into a summary paragraph, then kept the more recent exchanges (3, 4, 5) in full. The buffer now shows the summary for old turns plus verbatim text for new turns, using fewer total tokens than storing all 5 exchanges raw.
Common gotcha
Developers often expect summarization to happen immediately when the buffer gets large. In reality, ConversationSummaryMemory summarizes reactively: only when save_context() is called on a new message. If your conversation pauses for a while, the buffer stays large until the next turn comes in. Also, the summarization LLM call itself costs tokens; if your conversation is only 2-3 turns, the cost of summarization can exceed the cost of just keeping the full history.
Error recovery
AuthenticationErrorAttributeError on .bufferToken limit exceeded on summarizationExperienced dev note
Summarization memory sounds magical but has a hidden cost: every summarization triggers an LLM call. In a busy production chatbot, if you're summarizing every 100 messages, you're paying for that LLM summarization overhead on top of your regular conversation costs. For many use cases, ConversationSummaryBufferMemory (which uses a fixed token window and only summarizes if needed) or plain ConversationBufferWindowMemory (last N messages only) is cheaper and simpler. Summarization shines when you genuinely need semantic continuity over very long sessions (8+ hours), not for typical short-to-medium conversations.
Check your understanding
If you're building a chatbot for customer support where conversations often last 20+ turns but you want to cap token usage, why would summarization memory alone not be sufficient, and what hybrid approach would a senior developer recommend?
Show answer hint
A correct answer recognizes that summarization has latency (it costs an LLM call each time it triggers) and that summarized content can lose nuance. A hybrid approach would combine summarization for old turns with a retrieval step (RAG) to fetch relevant past details on demand, ensuring you keep both low token usage and high accuracy.