How to Intermediate · 3 min read

How to optimize context length

Quick answer

To optimize context length, manage input tokens by chunking large texts into smaller segments and use sliding windows to maintain relevant context. Prioritize essential information and truncate or summarize less important parts to fit within the model's token limit.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to interact with the OpenAI API.

bash

pip install openai

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to split a long text into chunks that fit within the gpt-4o model's context window, then send each chunk sequentially while preserving context using a sliding window approach.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example long text
long_text = """\
Your very long document or conversation text goes here. It might exceed the model's max token limit, so we need to chunk it.
"""

# Simple chunking function by character count (approximate token count)
chunk_size = 2000  # Adjust based on model max tokens (e.g., 8192 for gpt-4o)
chunks = [long_text[i:i+chunk_size] for i in range(0, len(long_text), chunk_size)]

# Sliding window context: keep last chunk's end as context for next
context = ""
for i, chunk in enumerate(chunks):
    prompt = context + chunk
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"Chunk {i+1} response:", response.choices[0].message.content)
    # Update context with last 500 chars of current chunk for next iteration
    context = chunk[-500:]

output

Chunk 1 response: [Model output for chunk 1]
Chunk 2 response: [Model output for chunk 2]
...

Common variations

You can optimize context length by:

Using gpt-4o-mini for smaller context windows and faster responses.
Implementing asynchronous calls with asyncio for parallel chunk processing.
Summarizing earlier chunks to reduce token usage before feeding subsequent chunks.

python

import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def process_chunk(chunk, context):
    prompt = context + chunk
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    long_text = "Your very long document or conversation text goes here."
    chunk_size = 1000
    chunks = [long_text[i:i+chunk_size] for i in range(0, len(long_text), chunk_size)]
    context = ""
    for i, chunk in enumerate(chunks):
        output = await process_chunk(chunk, context)
        print(f"Chunk {i+1} response:", output)
        context = chunk[-300:]

asyncio.run(main())

output

Chunk 1 response: [Async model output for chunk 1]
Chunk 2 response: [Async model output for chunk 2]
...

Troubleshooting

If you encounter context length exceeded errors, reduce your chunk size or summarize earlier context. If responses seem disconnected, increase overlap in your sliding window or maintain a summary of prior chunks to preserve continuity.

✅

Key Takeaways

Chunk large inputs to fit within the model's token limit to avoid truncation errors.
Use sliding windows or summaries to maintain relevant context across chunks.
Adjust chunk size based on the model's max context length and your application's latency requirements.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗