How to beginner · 3 min read

LLM streaming tokens explained

Q: LLM streaming tokens explained

Use stream=True in your LLM API call to receive tokens incrementally as they are generated, enabling real-time output. Each streamed chunk contains partial tokens, allowing your application to process or display tokens immediately without waiting for the full response.

Quick answer

Use stream=True in your LLM API call to receive tokens incrementally as they are generated, enabling real-time output. Each streamed chunk contains partial tokens, allowing your application to process or display tokens immediately without waiting for the full response.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the official openai Python package (v1+) and set your API key as an environment variable for secure authentication.

bash

pip install openai>=1.0

output

Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates streaming tokens from the gpt-4o model using the OpenAI SDK. The stream=True parameter enables incremental token delivery. The code prints tokens as they arrive.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Explain streaming tokens in LLMs."}]

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

print("Streaming response:")
for chunk in stream:
    token = chunk.choices[0].delta.content
    if token:
        print(token, end='', flush=True)
print()

output

Streaming response:
Streaming tokens allow your app to receive partial outputs from the model in real-time, enabling faster and more interactive user experiences.

Common variations

Use async with async for to stream tokens asynchronously.
Switch models by changing the model parameter, e.g., gpt-4o-mini or claude-3-5-sonnet-20241022.
For Anthropic Claude, use the anthropic SDK with stream=True in client.messages.create.

python

import asyncio
import os
from openai import OpenAI

async def async_stream():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Explain streaming tokens asynchronously."}]
    stream = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        stream=True
    )
    print("Async streaming response:")
    async for chunk in stream:
        token = chunk.choices[0].delta.content
        if token:
            print(token, end='', flush=True)
    print()

asyncio.run(async_stream())

output

Async streaming response:
Streaming tokens asynchronously lets your app handle tokens as soon as they arrive, improving responsiveness.

Troubleshooting

If streaming hangs or returns no tokens, verify your API key and network connectivity.
Ensure stream=True is set; otherwise, the API returns the full response at once.
Handle None tokens gracefully, as some chunks may only contain metadata.

✅

Key Takeaways

Set stream=True to receive tokens incrementally from LLMs for real-time output.
Process each token chunk as it arrives to improve user experience and reduce latency.
Use async streaming for non-blocking token handling in asynchronous applications.

Verified 2026-04 · gpt-4o, gpt-4o-mini, claude-3-5-sonnet-20241022

Verify ↗