Code beginner · 3 min read

How to stream OpenAI API responses in Python

Q: How to stream OpenAI API responses in Python

Use the OpenAI Python SDK's chat.completions.create method with stream=True and iterate over the response asynchronously or synchronously to receive tokens as they arrive.

Direct answer

Use the OpenAI Python SDK's chat.completions.create method with stream=True and iterate over the response asynchronously or synchronously to receive tokens as they arrive.

Setup

Install

bash

pip install openai

Env vars

OPENAI_API_KEY

Imports

python

import os
from openai import OpenAI

Examples

inHello, how are you?

outHello! I'm doing great, thanks for asking.

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks.

inWrite a short poem about spring.

outSpring blooms anew, with colors bright, Soft breezes dance in morning light.

Integration steps

Import the OpenAI SDK and set up the client with your API key from os.environ.
Prepare the chat messages array with roles and content.
Call chat.completions.create with stream=True to enable streaming.
Iterate over the streaming response to receive partial tokens as they arrive.
Concatenate or process tokens in real-time for immediate output or UI updates.

Full code

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Explain the benefits of streaming OpenAI responses in Python."}
]

print("Streaming response:")
response_stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

for chunk in response_stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()

output

Streaming response:
Streaming OpenAI API responses in Python allows you to receive tokens as they are generated, reducing latency and improving user experience by displaying partial results immediately.

API trace

Request

json

{"model": "gpt-4o", "messages": [{"role": "user", "content": "Explain the benefits of streaming OpenAI responses in Python."}], "stream": true}

Response

json

{"choices": [{"delta": {"content": "Streaming OpenAI API responses in Python allows you to receive tokens as they are generated..."}, "index": 0, "finish_reason": null}]}

Extractchunk.choices[0].delta.content

Variants

Async Streaming ›

Use async streaming when your application supports asynchronous I/O for better concurrency and responsiveness.

python

import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Tell me a joke."}]
    print("Async streaming response:")
    async for chunk in client.chat.completions.create(
        model="gpt-4o", messages=messages, stream=True
    ):
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

if __name__ == "__main__":
    asyncio.run(main())

Non-Streaming (Standard) ›

Use non-streaming for simple use cases where you want the full response at once without partial updates.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain streaming vs non-streaming."}]
response = client.chat.completions.create(model="gpt-4o", messages=messages)
print("Non-streaming response:", response.choices[0].message.content)

Use a Smaller Model for Faster Streaming ›

Use a smaller model like gpt-4o-mini to reduce latency and cost when streaming responses.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Summarize the benefits of streaming."}]
response_stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True
)
for chunk in response_stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

Performance

Latency~300-800ms initial token delay for <code>gpt-4o</code> streaming

Cost~$0.002 per 500 tokens for <code>gpt-4o</code> streamed calls

Rate limitsTier 1: 500 requests per minute, 30,000 tokens per minute

Use concise prompts to reduce token usage.
Stream responses to start processing tokens immediately.
Choose smaller models for cheaper, faster streaming.

Approach	Latency	Cost/call	Best for
Streaming (sync)	~300-800ms initial delay	~$0.002 per 500 tokens	Real-time UI updates
Streaming (async)	~300-800ms initial delay	~$0.002 per 500 tokens	Concurrent apps with async support
Non-streaming	~1-2s full response	~$0.002 per 500 tokens	Simple batch processing

✓

Quick tip

Always set <code>stream=True</code> in <code>chat.completions.create</code> and process chunks incrementally for real-time token output.

⚠

Common mistake

Beginners often forget to check for <code>None</code> in <code>chunk.choices[0].delta.content</code> causing errors when streaming partial responses.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗