Chunking strategies for summarization
Quick answer
Use
chunking to split large texts into manageable pieces before summarization with AI models. Common strategies include fixed-size chunks, semantic chunking by paragraphs or sentences, and overlap chunks to preserve context for better summary quality.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable for secure access.
pip install openai output
Collecting openai Downloading openai-1.0.0-py3-none-any.whl (50 kB) Installing collected packages: openai Successfully installed openai-1.0.0
Step by step
This example demonstrates chunking a long text into fixed-size overlapping chunks and summarizing each chunk using the gpt-4o model from OpenAI.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Sample long text
long_text = (
"Artificial intelligence (AI) is transforming industries by enabling machines to learn from data, "
"make decisions, and perform tasks that typically require human intelligence. However, large documents "
"can exceed model input limits, so chunking is essential for effective summarization. "
"Chunking strategies include fixed-size chunks, semantic chunks by paragraphs or sentences, "
"and overlapping chunks to maintain context continuity."
)
# Chunking parameters
chunk_size = 100 # characters
overlap = 20 # characters
# Function to create overlapping chunks
def chunk_text(text, size, overlap):
chunks = []
start = 0
while start < len(text):
end = min(start + size, len(text))
chunk = text[start:end]
chunks.append(chunk)
start += size - overlap
return chunks
chunks = chunk_text(long_text, chunk_size, overlap)
summaries = []
for i, chunk in enumerate(chunks):
messages = [
{"role": "user", "content": f"Summarize this text chunk:\n{chunk}"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
summary = response.choices[0].message.content.strip()
print(f"Chunk {i+1} summary:\n{summary}\n")
summaries.append(summary)
# Optionally, combine chunk summaries into a final summary
final_prompt = "Summarize the following summaries into a concise overall summary:\n" + "\n".join(summaries)
final_response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": final_prompt}]
)
final_summary = final_response.choices[0].message.content.strip()
print("Final combined summary:\n", final_summary) output
Chunk 1 summary: AI transforms industries by enabling machines to learn and make decisions. Chunking helps summarize large documents effectively. Chunk 2 summary: Common chunking methods include fixed-size, semantic, and overlapping chunks to preserve context. Final combined summary: AI revolutionizes industries by enabling intelligent machines. Effective summarization of large texts requires chunking strategies like fixed-size, semantic, and overlapping chunks to maintain context.
Common variations
- Use semantic chunking by splitting text on paragraphs or sentences for better context preservation.
- Implement asynchronous calls with
asynciofor faster processing of multiple chunks. - Try different models like
gpt-4o-minifor cost-effective summarization orclaude-3-5-sonnet-20241022for alternative APIs.
import asyncio
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
async def summarize_chunk(chunk):
response = await client.chat.completions.acreate(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize this chunk:\n{chunk}"}]
)
return response.choices[0].message.content.strip()
async def main():
chunks = ["First chunk text.", "Second chunk text.", "Third chunk text."]
summaries = await asyncio.gather(*(summarize_chunk(c) for c in chunks))
print("Summaries:", summaries)
asyncio.run(main()) output
Summaries: ['Summary of first chunk.', 'Summary of second chunk.', 'Summary of third chunk.']
Troubleshooting
- If you hit token limits, reduce chunk size or increase overlap cautiously.
- For incomplete summaries, ensure chunks have enough context by overlapping or semantic chunking.
- If API calls fail, verify your
OPENAI_API_KEYenvironment variable is set correctly.
Key Takeaways
- Chunk large texts into overlapping or semantic chunks to fit model input limits and preserve context.
- Use asynchronous API calls to speed up summarization of multiple chunks.
- Combine individual chunk summaries for a coherent overall summary.
- Adjust chunk size and overlap based on token limits and context needs.
- Always secure your API key via environment variables to avoid leaks.