Code beginner · 3 min read

How to stream OpenAI API responses in Python

Direct answer
Use the OpenAI Python SDK's chat.completions.create method with stream=True and iterate over the response asynchronously or synchronously to receive tokens as they arrive.

Setup

Install
bash
pip install openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from openai import OpenAI

Examples

inHello, how are you?
outHello! I'm doing great, thanks for asking.
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states simultaneously, enabling faster problem solving for certain tasks.
inWrite a short poem about spring.
outSpring blooms anew, with colors bright, Soft breezes dance in morning light.

Integration steps

  1. Import the OpenAI SDK and set up the client with your API key from os.environ.
  2. Prepare the chat messages array with roles and content.
  3. Call chat.completions.create with stream=True to enable streaming.
  4. Iterate over the streaming response to receive partial tokens as they arrive.
  5. Concatenate or process tokens in real-time for immediate output or UI updates.

Full code

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Explain the benefits of streaming OpenAI responses in Python."}
]

print("Streaming response:")
response_stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

for chunk in response_stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
output
Streaming response:
Streaming OpenAI API responses in Python allows you to receive tokens as they are generated, reducing latency and improving user experience by displaying partial results immediately.

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Explain the benefits of streaming OpenAI responses in Python."}], "stream": true}
Response
json
{"choices": [{"delta": {"content": "Streaming OpenAI API responses in Python allows you to receive tokens as they are generated..."}, "index": 0, "finish_reason": null}]}
Extractchunk.choices[0].delta.content

Variants

Async Streaming

Use async streaming when your application supports asynchronous I/O for better concurrency and responsiveness.

python
import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Tell me a joke."}]
    print("Async streaming response:")
    async for chunk in client.chat.completions.create(
        model="gpt-4o", messages=messages, stream=True
    ):
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
    print()

if __name__ == "__main__":
    asyncio.run(main())
Non-Streaming (Standard)

Use non-streaming for simple use cases where you want the full response at once without partial updates.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain streaming vs non-streaming."}]
response = client.chat.completions.create(model="gpt-4o", messages=messages)
print("Non-streaming response:", response.choices[0].message.content)
Use a Smaller Model for Faster Streaming

Use a smaller model like gpt-4o-mini to reduce latency and cost when streaming responses.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Summarize the benefits of streaming."}]
response_stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True
)
for chunk in response_stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

Performance

Latency~300-800ms initial token delay for <code>gpt-4o</code> streaming
Cost~$0.002 per 500 tokens for <code>gpt-4o</code> streamed calls
Rate limitsTier 1: 500 requests per minute, 30,000 tokens per minute
  • Use concise prompts to reduce token usage.
  • Stream responses to start processing tokens immediately.
  • Choose smaller models for cheaper, faster streaming.
ApproachLatencyCost/callBest for
Streaming (sync)~300-800ms initial delay~$0.002 per 500 tokensReal-time UI updates
Streaming (async)~300-800ms initial delay~$0.002 per 500 tokensConcurrent apps with async support
Non-streaming~1-2s full response~$0.002 per 500 tokensSimple batch processing

Quick tip

Always set <code>stream=True</code> in <code>chat.completions.create</code> and process chunks incrementally for real-time token output.

Common mistake

Beginners often forget to check for <code>None</code> in <code>chunk.choices[0].delta.content</code> causing errors when streaming partial responses.

Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗