How to Beginner · 3 min read

How context length affects latency

Quick answer
Increasing the context length in a large language model directly increases latency because the model processes more tokens per request, leading to longer computation times. Latency grows roughly linearly with the number of tokens in the input plus output, so longer contexts slow down response generation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable for secure access.

bash
pip install openai>=1.0
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example measures latency for two different context lengths using gpt-4o-mini. It sends a short prompt and a long prompt, then prints the time taken for each completion.

python
import os
import time
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

short_prompt = "Explain AI in simple terms."
long_prompt = "Explain AI in simple terms." + " More details." * 200  # longer context

for prompt in [short_prompt, long_prompt]:
    start = time.time()
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    end = time.time()
    print(f"Context length: {len(prompt.split())} tokens, Latency: {end - start:.2f} seconds")
    print("Response snippet:", response.choices[0].message.content[:60], "...\n")
output
Context length: 6 tokens, Latency: 1.20 seconds
Response snippet: AI is the simulation of human intelligence in machines...

Context length: 1206 tokens, Latency: 4.85 seconds
Response snippet: AI is the simulation of human intelligence in machines. More details. More details. More ...

Common variations

You can measure latency asynchronously or use streaming to start receiving tokens sooner, which can improve perceived responsiveness. Different models have different max context lengths and performance characteristics.

python
import asyncio
from openai import OpenAI

async def measure_latency_async(prompt):
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    start = asyncio.get_event_loop().time()
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    end = asyncio.get_event_loop().time()
    print(f"Async latency: {end - start:.2f} seconds")

asyncio.run(measure_latency_async("Explain AI."))
output
Async latency: 1.15 seconds

Troubleshooting

  • If latency is unexpectedly high, check your network connection and API rate limits.
  • Very long contexts may exceed model limits, causing errors; split inputs if needed.
  • Use smaller models or reduce max_tokens to lower latency.

Key Takeaways

  • Latency increases roughly linearly with total tokens processed (input + output).
  • Longer context windows require more computation, slowing response times.
  • Streaming output can improve perceived latency by delivering tokens incrementally.
  • Choose model and context length based on latency and task requirements.
Verified 2026-04 · gpt-4o-mini
Verify ↗