How to handle streaming chunks from OpenAI
stream=True parameter in client.chat.completions.create(). Iterate over the returned async or sync generator to process each chunk's delta.content incrementally for real-time streaming output.code_error chunk.choices[0].delta.content safely, handling empty chunks.Why this happens
Developers often try to handle streaming responses from OpenAI by treating the response as a single object instead of an iterable stream. This leads to errors or missing partial outputs. For example, calling response.choices[0].message.content directly on a streaming response will fail because the response is a generator yielding chunks, not a complete message.
Typical broken code:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
# Incorrect: trying to access full content directly
print(response.choices[0].message.content) AttributeError: 'generator' object has no attribute 'choices'
The fix
Use a for-loop to iterate over the streaming response. Each chunk contains partial content in chunk.choices[0].delta.content. Append or print these chunks incrementally to reconstruct the full message as it streams.
This works because the SDK returns an iterable stream of partial completions, allowing real-time processing.
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
full_response = ""
for chunk in response:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full_response += delta
print("\nFull response:", full_response) Hello Full response: Hello
Preventing it in production
Implement robust retry logic with exponential backoff to handle transient network or rate limit errors during streaming. Validate that each chunk contains delta.content before processing to avoid NoneType errors. Consider buffering partial chunks if you need to process or store the full response after streaming.
Also, handle stream termination gracefully by checking for finish_reason in the final chunk.
Key Takeaways
- Always iterate over the streaming response to handle partial chunks correctly.
- Extract partial content from chunk.choices[0].delta.content safely to build the full output.
- Implement retries and validate chunk content to ensure robust streaming in production.