How to stream chatbot responses in Python
Direct answer
Use the OpenAI SDK's chat.completions.create method with stream=true and iterate over the response asynchronously or synchronously to receive tokens as they arrive.
Setup
Install
pip install openai Env vars
OPENAI_API_KEY Imports
import os
from openai import OpenAI Examples
inHello, how are you?
outHi! I'm doing great, thanks for asking. How can I assist you today?
inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
inTell me a joke about programmers.
outWhy do programmers prefer dark mode? Because light attracts bugs!
Integration steps
- Import the OpenAI client and initialize it with your API key from environment variables.
- Prepare the chat messages array with user input.
- Call chat.completions.create with stream=true to enable streaming.
- Iterate over the streaming response chunks to receive partial tokens.
- Concatenate or display tokens in real-time as they arrive.
- Handle end of stream and errors gracefully.
Full code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [
{"role": "user", "content": "Explain quantum computing in simple terms."}
]
print("Streaming response:")
response_stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
full_response = ""
for chunk in response_stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full_response += delta
print()
# Optionally use full_response for further processing output
Streaming response: Quantum computing uses quantum bits that can be in multiple states at once, enabling faster problem solving for certain tasks.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Explain quantum computing in simple terms."}], "stream": true} Response
{"choices": [{"delta": {"content": "Quantum computing uses..."}, "index": 0, "finish_reason": null}], "id": "chatcmpl-xxx", "object": "chat.completion.chunk"} Extract
chunk.choices[0].delta.contentVariants
Async Streaming with OpenAI SDK ›
Use async streaming when integrating with async frameworks or to handle multiple concurrent streaming calls efficiently.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Tell me a joke about programmers."}]
print("Async streaming response:")
response_stream = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
full_response = ""
async for chunk in response_stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full_response += delta
print()
asyncio.run(main()) Streaming with Anthropic Claude ›
Use Anthropic Claude streaming if you prefer Claude models or need specific Claude capabilities.
import os
import anthropic
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
print("Streaming response from Claude:")
response = client.messages.create(
model="claude-3-5-haiku-20241022",
system="You are a helpful assistant.",
messages=messages,
max_tokens=1024,
stream=True
)
for chunk in response:
print(chunk.content, end="", flush=True)
print() Non-Streaming Chat Completion ›
Use non-streaming when you want the full response at once and do not need real-time token output.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Full response:")
print(response.choices[0].message.content) Performance
Latency~800ms for gpt-4o non-streaming; streaming latency depends on token generation speed
Cost~$0.002 per 500 tokens for gpt-4o
Rate limitsTier 1: 500 requests per minute / 30,000 tokens per minute
- Limit message history length to reduce tokens.
- Use concise prompts to save tokens.
- Stream to start displaying output immediately, improving perceived latency.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Non-Streaming | ~800ms | ~$0.002 per 500 tokens | Simple use cases, batch processing |
| Streaming | Starts within 200-400ms, tokens arrive progressively | ~$0.002 per 500 tokens | Real-time UI, chatbots, better UX |
| Async Streaming | Similar to streaming but non-blocking | ~$0.002 per 500 tokens | Concurrent calls, async frameworks |
Quick tip
Always set stream=true in chat.completions.create and iterate over the response to get tokens as they arrive.
Common mistake
Beginners often forget to check for delta.content in each chunk and try to access content directly, causing errors.