Code Intermediate medium · 6 min

Summarization memory: compressing long histories

What you will learn

Use summarization memory to compress long conversation histories into a condensed summary, keeping tokens low while maintaining context.

Why this matters

Long conversations blow up your token budget and API costs. Summarization memory automatically condenses old exchanges into a compact summary, keeping recent messages intact: essential for chatbots that need to stay cheap and responsive over multi-turn conversations.

Skip if: Don't use summarization memory if your conversations are naturally short (under 10 exchanges), if you need byte-for-byte recall of every detail for compliance, or if you're building a system where losing nuance in old messages breaks functionality. For RAG systems with retrieval, consider retrieving old context instead of summarizing it.

Explanation

Summarization memory is a LangChain memory strategy that automatically compresses older conversation turns into a summary while keeping recent messages in full. Instead of storing the entire history (which grows linearly), it maintains a condensed summary of what happened earlier plus the last N full messages, dramatically reducing token usage. Mechanically, it works by detecting when the conversation exceeds a token threshold, then triggering an LLM call to summarize everything before the recent window into a single paragraph, replacing that older content with the summary. This is ConversationSummaryMemory in LangChain 1.2.x: it hooks into the memory system and triggers during save_context() calls. Use this when you're building long-running conversational agents where users may have multi-hour sessions, cost is a real constraint, and you can tolerate slight loss of detail in old exchanges for the sake of token savings.

Analogy

Think of a teacher grading essays over a semester. Instead of reviewing every single essay from day one, they write a summary: 'First month: student struggled with thesis statements but improved week 2 onward.' Then they keep the last 5 essays in full. When summarizing happens mid-semester, old essays get rolled into that summary paragraph. Recent work stays detailed.

Code

Illustrative only - not runnable without a valid API key

python

import os
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
from langchain.memory import ConversationSummaryMemory

api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
    raise ValueError('Set OPENAI_API_KEY environment variable')

llm = ChatOpenAI(model='gpt-4o-mini', api_key=api_key)

memory = ConversationSummaryMemory(
    llm=llm,
    buffer="The conversation is between a user and an AI assistant."
)

for i in range(5):
    human_input = f"Question {i+1}: Tell me about topic number {i+1}"
    print(f"Human: {human_input}")
    memory.save_context(
        {"input": human_input},
        {"output": f"Here is information about topic {i+1}. It is important and useful."}
    )
    print(f"Memory buffer after turn {i+1}:")
    print(memory.buffer)
    print("---")
    print()

print("\nFinal memory state:")
print(memory.buffer)

Output

Human: Question 1: Tell me about topic number 1
Memory buffer after turn 1:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
---

Human: Question 2: Tell me about topic number 2
Memory buffer after turn 2:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
Human: Question 2: Tell me about topic number 2
AI: Here is information about topic 2. It is important and useful.
---

Human: Question 3: Tell me about topic number 3
Memory buffer after turn 3:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
Human: Question 2: Tell me about topic number 2
AI: Here is information about topic 2. It is important and useful.
Human: Question 3: Tell me about topic number 3
AI: Here is information about topic 3. It is important and useful.
---

Human: Question 4: Tell me about topic number 4
Memory buffer after turn 4:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
Human: Question 2: Tell me about topic number 2
AI: Here is information about topic 2. It is important and useful.
Human: Question 3: Tell me about topic number 3
AI: Here is information about topic 3. It is important and useful.
Human: Question 4: Tell me about topic number 4
AI: Here is information about topic 4. It is important and useful.
---

Human: Question 5: Tell me about topic number 5
Memory buffer after turn 5:
Progressively summarizing new lines of the conversation. Processed the following message:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
Human: Question 2: Tell me about topic number 2
AI: Here is information about topic 2. It is important and useful.

Summary so far:
The user and AI engaged in an informative discussion about two topics. In the first exchange, the user asked about topic 1, and the AI provided useful information on the subject. Subsequently, the user inquired about topic 2, and the AI furnished relevant information about that topic as well.

Human: Question 3: Tell me about topic number 3
AI: Here is information about topic 3. It is important and useful.
Human: Question 4: Tell me about topic number 4
AI: Here is information about topic 4. It is important and useful.
Human: Question 5: Tell me about topic number 5
AI: Here is information about topic 5. It is important and useful.
---

Final memory state:
Progressively summarizing new lines of the conversation. Processed the following message:
Human: Question 1: Tell me about topic number 1
AI: Here is information about topic 1. It is important and useful.
Human: Question 2: Tell me about topic number 2
AI: Here is information about topic 2. It is important and useful.

What just happened?

The code created a <code>ConversationSummaryMemory</code> instance backed by GPT-4o-mini. It then looped 5 times, each iteration saving a human question and AI response. Around turn 3-4, the memory buffer grew large enough that <code>ConversationSummaryMemory</code> triggered an automatic summarization: the LLM condensed the first two exchanges into a summary paragraph, then kept the more recent exchanges (3, 4, 5) in full. The buffer now shows the summary for old turns plus verbatim text for new turns, using fewer total tokens than storing all 5 exchanges raw.

Common gotcha

Developers often expect summarization to happen immediately when the buffer gets large. In reality, ConversationSummaryMemory summarizes reactively: only when save_context() is called on a new message. If your conversation pauses for a while, the buffer stays large until the next turn comes in. Also, the summarization LLM call itself costs tokens; if your conversation is only 2-3 turns, the cost of summarization can exceed the cost of just keeping the full history.

Error recovery

AuthenticationError

Ensure OPENAI_API_KEY is set and valid. Summarization memory requires an LLM to do the summarization: if the key is missing or expired, the memory will fail at the first save_context() call.

AttributeError on .buffer

ConversationSummaryMemory's .buffer attribute is read-only for most operations. Don't try to assign to it directly: use save_context() and load_memory_variables() instead.

Token limit exceeded on summarization

If even the summarization call itself exceeds token limits (rare but possible with huge histories), lower the max_token_limit parameter or switch to BufferWindowMemory for a fixed window instead of summarization.

Experienced dev note

Summarization memory sounds magical but has a hidden cost: every summarization triggers an LLM call. In a busy production chatbot, if you're summarizing every 100 messages, you're paying for that LLM summarization overhead on top of your regular conversation costs. For many use cases, ConversationSummaryBufferMemory (which uses a fixed token window and only summarizes if needed) or plain ConversationBufferWindowMemory (last N messages only) is cheaper and simpler. Summarization shines when you genuinely need semantic continuity over very long sessions (8+ hours), not for typical short-to-medium conversations.

Check your understanding

If you're building a chatbot for customer support where conversations often last 20+ turns but you want to cap token usage, why would summarization memory alone not be sufficient, and what hybrid approach would a senior developer recommend?

Show answer hint

A correct answer recognizes that summarization has latency (it costs an LLM call each time it triggers) and that summarized content can lose nuance. A hybrid approach would combine summarization for old turns with a retrieval step (RAG) to fetch relevant past details on demand, ensuring you keep both low token usage and high accuracy.

VERSION ConversationSummaryMemory is stable in langchain 1.2.x. However, ensure you're using langchain_openai (not the deprecated langchain.chat_models) for the ChatOpenAI import. In langchain < 1.0.0, the memory API was slightly different; upgrade to 1.2.x for this pattern.

Once you master summarization memory, explore <code>ConversationSummaryBufferMemory</code> next: it combines a fixed-size recent buffer with automatic summarization, giving you the best of both cost and context control.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.