Code beginner · 3 min read

How to stream chatbot responses in Python

Direct answer

Use the OpenAI SDK's chat.completions.create method with stream=true and iterate over the response asynchronously or synchronously to receive tokens as they arrive.

Setup

Install

bash

pip install openai

Env vars

OPENAI_API_KEY

Imports

python

import os
from openai import OpenAI

Examples

inHello, how are you?

outHi! I'm doing great, thanks for asking. How can I assist you today?

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

inTell me a joke about programmers.

outWhy do programmers prefer dark mode? Because light attracts bugs!

Integration steps

Import the OpenAI client and initialize it with your API key from environment variables.
Prepare the chat messages array with user input.
Call chat.completions.create with stream=true to enable streaming.
Iterate over the streaming response chunks to receive partial tokens.
Concatenate or display tokens in real-time as they arrive.
Handle end of stream and errors gracefully.

Full code

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

print("Streaming response:")
response_stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

full_response = ""
for chunk in response_stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
        full_response += delta
print()

# Optionally use full_response for further processing

output

Streaming response:
Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.

API trace

Request

json

{"model": "gpt-4o", "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], "stream": true}

Response

json

{"choices": [{"delta": {"content": "Quantum computing uses..."}, "index": 0, "finish_reason": null}], "id": "chatcmpl-xxx", "object": "chat.completion.chunk"}

Extractchunk.choices[0].delta.content

Variants

Async Streaming with OpenAI SDK ›

Use async streaming when integrating with async frameworks or to handle multiple concurrent streaming calls efficiently.

python

import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "Tell me a joke about programmers."}]
    print("Async streaming response:")
    response_stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True
    )
    full_response = ""
    async for chunk in response_stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
            full_response += delta
    print()

asyncio.run(main())

Streaming with Anthropic Claude ›

Use Anthropic Claude streaming if you prefer Claude models or need specific Claude capabilities.

python

import os
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

print("Streaming response from Claude:")
response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    system="You are a helpful assistant.",
    messages=messages,
    max_tokens=1024,
    stream=True
)

for chunk in response:
    print(chunk.content, end="", flush=True)
print()

Non-Streaming Chat Completion ›

Use non-streaming when you want the full response at once and do not need real-time token output.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)
print("Full response:")
print(response.choices[0].message.content)

Performance

Latency~800ms for gpt-4o non-streaming; streaming latency depends on token generation speed

Cost~$0.002 per 500 tokens for gpt-4o

Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute

Limit message history length to reduce tokens.
Use concise prompts to save tokens.
Stream to start displaying output immediately, improving perceived latency.

Approach	Latency	Cost/call	Best for
Non-Streaming	~800ms	~$0.002 per 500 tokens	Simple use cases, batch processing
Streaming	Starts within 200-400ms, tokens arrive progressively	~$0.002 per 500 tokens	Real-time UI, chatbots, better UX
Async Streaming	Similar to streaming but non-blocking	~$0.002 per 500 tokens	Concurrent calls, async frameworks

✓

Quick tip

Always set stream=true in chat.completions.create and iterate over the response to get tokens as they arrive.

⚠

Common mistake

Beginners often forget to check for delta.content in each chunk and try to access content directly, causing errors.

Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022

Verify ↗