How to stream OpenAI responses in python
Direct answer
Use the
stream=True parameter with client.chat.completions.create() in the OpenAI Python SDK to receive streamed responses token-by-token.Setup
Install
pip install openai Env vars
OPENAI_API_KEY Imports
import os
from openai import OpenAI Examples
inUser message: 'Write a short poem about spring.'
outStreaming tokens as they arrive, printing the poem line by line.
inUser message: 'Explain quantum computing in simple terms.'
outStreaming explanation tokens in real-time for immediate display.
inUser message: '' (empty input)
outStreaming minimal or no tokens, handling empty input gracefully.
Integration steps
- Import the OpenAI client and initialize it with the API key from os.environ.
- Create a messages list with the user prompt.
- Call
client.chat.completions.create()withstream=Trueand the model name. - Iterate over the streaming response to receive tokens as they arrive.
- Concatenate or process tokens in real-time for display or further processing.
Full code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Write a short poem about spring."}]
response_stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
print("Streaming response:")
collected_text = ""
for chunk in response_stream:
# Each chunk is a dict with choices list
delta = chunk.choices[0].delta
if "content" in delta:
token = delta["content"]
print(token, end="", flush=True)
collected_text += token
print()
# collected_text now contains the full response output
Streaming response: Spring whispers softly, Blossoms dance in warm sunlight, New life awakens.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Write a short poem about spring."}], "stream": true} Response
{"choices": [{"delta": {"content": "token text"}, "index": 0, "finish_reason": null}], "id": "chatcmpl-xxx", "object": "chat.completion.chunk"} Extract
Iterate over response stream and concatenate chunk.choices[0].delta.content tokensVariants
Async streaming with OpenAI Python SDK ›
Use async streaming to handle multiple concurrent streaming requests efficiently.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain AI in simple terms."}]
response_stream = await client.chat.completions.acreate(
model="gpt-4o",
messages=messages,
stream=True
)
print("Async streaming response:")
collected_text = ""
async for chunk in response_stream:
delta = chunk.choices[0].delta
if "content" in delta:
token = delta["content"]
print(token, end="", flush=True)
collected_text += token
print()
asyncio.run(main()) Non-streaming standard completion ›
Use non-streaming for simpler use cases where you want the full response at once.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a short poem about spring."}]
)
print(response.choices[0].message.content) Streaming with a smaller model (gpt-4o-mini) ›
Use smaller models for cost-effective streaming with slightly reduced output quality.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response_stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize the latest news."}],
stream=True
)
for chunk in response_stream:
delta = chunk.choices[0].delta
if "content" in delta:
print(delta["content"], end="", flush=True)
print() Performance
Latency~500-800ms initial token delay for gpt-4o streaming
Cost~$0.002 per 500 tokens for gpt-4o streaming
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
- Use concise prompts to reduce token usage.
- Stream responses to start processing output immediately.
- Limit max_tokens parameter to control response length.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Streaming (gpt-4o) | ~500-800ms initial delay | ~$0.002 per 500 tokens | Real-time token display |
| Non-streaming (gpt-4o) | ~800ms total | ~$0.002 per 500 tokens | Simple full response retrieval |
| Streaming (gpt-4o-mini) | ~400-600ms initial delay | ~$0.001 per 500 tokens | Cost-effective streaming |
Quick tip
Always set <code>stream=True</code> and iterate over the response to get tokens as they arrive for real-time UX.
Common mistake
Beginners often forget to iterate over the streaming generator, causing no output until the stream ends.