How to Intermediate · 3 min read

How to optimize context length

Quick answer
To optimize context length, manage input tokens by chunking large texts into smaller segments and use sliding windows to maintain relevant context. Prioritize essential information and truncate or summarize less important parts to fit within the model's token limit.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to interact with the OpenAI API.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates how to split a long text into chunks that fit within the gpt-4o model's context window, then send each chunk sequentially while preserving context using a sliding window approach.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example long text
long_text = """\
Your very long document or conversation text goes here. It might exceed the model's max token limit, so we need to chunk it.
"""

# Simple chunking function by character count (approximate token count)
chunk_size = 2000  # Adjust based on model max tokens (e.g., 8192 for gpt-4o)
chunks = [long_text[i:i+chunk_size] for i in range(0, len(long_text), chunk_size)]

# Sliding window context: keep last chunk's end as context for next
context = ""
for i, chunk in enumerate(chunks):
    prompt = context + chunk
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"Chunk {i+1} response:", response.choices[0].message.content)
    # Update context with last 500 chars of current chunk for next iteration
    context = chunk[-500:]
output
Chunk 1 response: [Model output for chunk 1]
Chunk 2 response: [Model output for chunk 2]
...

Common variations

You can optimize context length by:

  • Using gpt-4o-mini for smaller context windows and faster responses.
  • Implementing asynchronous calls with asyncio for parallel chunk processing.
  • Summarizing earlier chunks to reduce token usage before feeding subsequent chunks.
python
import asyncio
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def process_chunk(chunk, context):
    prompt = context + chunk
    response = await client.chat.completions.acreate(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

async def main():
    long_text = "Your very long document or conversation text goes here."
    chunk_size = 1000
    chunks = [long_text[i:i+chunk_size] for i in range(0, len(long_text), chunk_size)]
    context = ""
    for i, chunk in enumerate(chunks):
        output = await process_chunk(chunk, context)
        print(f"Chunk {i+1} response:", output)
        context = chunk[-300:]

asyncio.run(main())
output
Chunk 1 response: [Async model output for chunk 1]
Chunk 2 response: [Async model output for chunk 2]
...

Troubleshooting

If you encounter context length exceeded errors, reduce your chunk size or summarize earlier context. If responses seem disconnected, increase overlap in your sliding window or maintain a summary of prior chunks to preserve continuity.

Key Takeaways

  • Chunk large inputs to fit within the model's token limit to avoid truncation errors.
  • Use sliding windows or summaries to maintain relevant context across chunks.
  • Adjust chunk size based on the model's max context length and your application's latency requirements.
Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗