Code beginner · 3 min read

How to stream chatbot responses in Python

Direct answer
Use the OpenAI SDK's chat.completions.create method with stream=true and iterate over the response asynchronously or synchronously to receive tokens as they arrive.

Setup

Install
bash
pip install openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from openai import OpenAI

Examples

inHello, how are you?
outHi! I'm doing great, thanks for asking. How can I assist you today?
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
inTell me a joke about programmers.
outWhy do programmers prefer dark mode? Because light attracts bugs!

Integration steps

  1. Import the OpenAI client and initialize it with your API key from environment variables.
  2. Prepare the chat messages array with user input.
  3. Call chat.completions.create with stream=true to enable streaming.
  4. Iterate over the streaming response chunks to receive partial tokens.
  5. Concatenate or display tokens in real-time as they arrive.
  6. Handle end of stream and errors gracefully.

Full code

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

print("Streaming response:")
response_stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

full_response = ""
for chunk in response_stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
        full_response += delta
print()

# Optionally use full_response for further processing
output
Streaming response:
Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], "stream": true}
Response
json
{"choices": [{"delta": {"content": "Quantum computing uses..."}, "index": 0, "finish_reason": null}], "id": "chatcmpl-xxx", "object": "chat.completion.chunk"}
Extractchunk.choices[0].delta.content

Variants

Async Streaming with OpenAI SDK

Use async streaming when integrating with async frameworks or to handle multiple concurrent streaming calls efficiently.

python
import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Tell me a joke about programmers."}]
    print("Async streaming response:")
    response_stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True
    )
    full_response = ""
    async for chunk in response_stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
            full_response += delta
    print()

asyncio.run(main())
Streaming with Anthropic Claude

Use Anthropic Claude streaming if you prefer Claude models or need specific Claude capabilities.

python
import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

print("Streaming response from Claude:")
response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    system="You are a helpful assistant.",
    messages=messages,
    max_tokens=1024,
    stream=True
)

for chunk in response:
    print(chunk.content, end="", flush=True)
print()
Non-Streaming Chat Completion

Use non-streaming when you want the full response at once and do not need real-time token output.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print("Full response:")
print(response.choices[0].message.content)

Performance

Latency~800ms for gpt-4o non-streaming; streaming latency depends on token generation speed
Cost~$0.002 per 500 tokens for gpt-4o
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
  • Limit message history length to reduce tokens.
  • Use concise prompts to save tokens.
  • Stream to start displaying output immediately, improving perceived latency.
ApproachLatencyCost/callBest for
Non-Streaming~800ms~$0.002 per 500 tokensSimple use cases, batch processing
StreamingStarts within 200-400ms, tokens arrive progressively~$0.002 per 500 tokensReal-time UI, chatbots, better UX
Async StreamingSimilar to streaming but non-blocking~$0.002 per 500 tokensConcurrent calls, async frameworks

Quick tip

Always set stream=true in chat.completions.create and iterate over the response to get tokens as they arrive.

Common mistake

Beginners often forget to check for delta.content in each chunk and try to access content directly, causing errors.

Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗