How to beginner · 4 min read

How to summarize long documents with LLM

Quick answer

Use a chunking strategy to split long documents into manageable parts, then summarize each chunk with a gpt-4o or similar model via the OpenAI SDK. Finally, combine chunk summaries into a concise overall summary with another LLM call.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your OpenAI API key as an environment variable.

Run pip install openai to install the SDK.
Set your API key in your shell: export OPENAI_API_KEY='your_api_key_here' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key_here" (Windows).

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates chunking a long text, summarizing each chunk with gpt-4o, then combining those summaries into a final summary.

python

import os
from openai import OpenAI

# Initialize client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example long document (replace with your own text)
long_text = """\
Artificial intelligence (AI) is transforming industries by enabling machines to perform tasks that typically require human intelligence. This includes natural language processing, computer vision, and decision-making. However, processing very long documents can exceed model context limits, so chunking is essential. By splitting the document into smaller parts, we can summarize each part individually and then synthesize a final summary.
"""

# Function to chunk text into smaller pieces

def chunk_text(text, max_tokens=500):
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        current_chunk.append(word)
        current_length += 1
        if current_length >= max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

chunks = chunk_text(long_text, max_tokens=50)  # small chunk for demo

# Summarize each chunk
chunk_summaries = []
for i, chunk in enumerate(chunks):
    messages = [
        {"role": "user", "content": f"Summarize this text:\n{chunk}"}
    ]
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )
    summary = response.choices[0].message.content.strip()
    chunk_summaries.append(summary)
    print(f"Chunk {i+1} summary:\n{summary}\n")

# Combine chunk summaries into final summary
combined_text = "\n".join(chunk_summaries)
final_messages = [
    {"role": "user", "content": f"Summarize the following summaries into a concise overall summary:\n{combined_text}"}
]
final_response = client.chat.completions.create(
    model="gpt-4o",
    messages=final_messages
)
final_summary = final_response.choices[0].message.content.strip()
print("Final summary:\n", final_summary)

output

Chunk 1 summary:
Artificial intelligence enables machines to perform tasks requiring human intelligence, such as language processing and decision-making.

Final summary:
Artificial intelligence transforms industries by enabling machines to perform complex tasks like natural language processing and decision-making, with chunking helping to summarize long documents effectively.

Common variations

You can adapt this approach by:

Using async calls with asyncio and await for faster parallel chunk summarization.
Streaming partial summaries with stream=True for real-time output.
Using other models like gpt-4o-mini for cost savings or claude-3-5-sonnet-20241022 via Anthropic SDK.
Adjusting chunk size based on token limits of your chosen model.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def summarize_chunk(chunk):
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize this text:\n{chunk}"}]
    )
    return response.choices[0].message.content.strip()

async def main():
    chunks = ["Text chunk 1", "Text chunk 2"]
    summaries = await asyncio.gather(*(summarize_chunk(c) for c in chunks))
    combined = "\n".join(summaries)
    final = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Summarize these summaries:\n{combined}"}]
    )
    print("Final summary:", final.choices[0].message.content.strip())

asyncio.run(main())

output

Final summary: [Concise summary of the combined chunks]

Troubleshooting

If you get context length exceeded errors, reduce chunk size or switch to a model with a larger context window.
If summaries are too generic, provide more detailed instructions in the prompt.
For rate limit errors, add retry logic with exponential backoff.
Ensure your API key is correctly set in os.environ["OPENAI_API_KEY"].

✅

Key Takeaways

Split long documents into chunks within model token limits before summarizing.
Summarize each chunk individually, then combine summaries for a final concise output.
Use async calls or streaming for efficiency and responsiveness.
Adjust chunk size and model choice based on your use case and cost constraints.
Handle API errors by checking context limits, rate limits, and prompt clarity.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗