LLM streaming tokens explained
Quick answer
Use
stream=True in your LLM API call to receive tokens incrementally as they are generated, enabling real-time output. Each streamed chunk contains partial tokens, allowing your application to process or display tokens immediately without waiting for the full response.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the official openai Python package (v1+) and set your API key as an environment variable for secure authentication.
pip install openai>=1.0 output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example demonstrates streaming tokens from the gpt-4o model using the OpenAI SDK. The stream=True parameter enables incremental token delivery. The code prints tokens as they arrive.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain streaming tokens in LLMs."}]
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
print("Streaming response:")
for chunk in stream:
token = chunk.choices[0].delta.content
if token:
print(token, end='', flush=True)
print() output
Streaming response: Streaming tokens allow your app to receive partial outputs from the model in real-time, enabling faster and more interactive user experiences.
Common variations
- Use
asyncwithasync forto stream tokens asynchronously. - Switch models by changing the
modelparameter, e.g.,gpt-4o-miniorclaude-3-5-sonnet-20241022. - For Anthropic Claude, use the
anthropicSDK withstream=Trueinclient.messages.create.
import asyncio
import os
from openai import OpenAI
async def async_stream():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain streaming tokens asynchronously."}]
stream = await client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True
)
print("Async streaming response:")
async for chunk in stream:
token = chunk.choices[0].delta.content
if token:
print(token, end='', flush=True)
print()
asyncio.run(async_stream()) output
Async streaming response: Streaming tokens asynchronously lets your app handle tokens as soon as they arrive, improving responsiveness.
Troubleshooting
- If streaming hangs or returns no tokens, verify your API key and network connectivity.
- Ensure
stream=Trueis set; otherwise, the API returns the full response at once. - Handle
Nonetokens gracefully, as some chunks may only contain metadata.
Key Takeaways
- Set
stream=Trueto receive tokens incrementally from LLMs for real-time output. - Process each token chunk as it arrives to improve user experience and reduce latency.
- Use async streaming for non-blocking token handling in asynchronous applications.