NotSupportedError
openai.NotSupportedError: Streaming is not supported with this model (o1, o3)
Stack trace
openai.NotSupportedError: Streaming is not supported with this model (o1, o3).
File "<your-script>.py", line 42, in get_reasoning_response
response = client.chat.completions.create(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/site-packages/openai/resources/chat/completions.py", line 187, in create
return self._post(
^^^^^^^^^^^
File "/site-packages/openai/base_client.py", line 1289, in post
return self._request(
^^^^^^^^^^^^
File "/site-packages/openai/base_client.py", line 988, in _request
raise err.error
openai.NotSupportedError: Streaming is not supported with this model (o1, o3). Only non-streaming requests are supported. Why it happens
OpenAI's reasoning models (o1, o3) use extended thinking internally, which requires completing the entire reasoning process before returning a response. Streaming responses would require sending partial reasoning state, which breaks the reasoning model's ability to explore solution space cohesively. The OpenAI SDK v1.3.0+ enforces this limitation at the client level by raising NotSupportedError when stream=True is passed with o1/o3 models.
Detection
Check for stream=True in any client.chat.completions.create() call targeting o1, o3-mini, or o3 models. Use grep or IDE search to audit existing code: grep -r 'stream=True' your_codebase.py | grep -E '(o1|o3)'.
Causes & fixes
Passing stream=True when calling client.chat.completions.create() with model='o1' or 'o3-mini'
Remove stream=True entirely or set stream=False (the default). Non-streaming is required for reasoning models.
Migrating code from gpt-4o/gpt-4o-mini (which support streaming) to o1/o3 without updating stream parameter
Create a conditional: if model.startswith('o1') or model.startswith('o3'): use stream=False; else: use stream=True if needed. Or use a model compatibility matrix.
Using streaming wrappers or libraries (like LangChain's StreamingCallbackHandler) with o1/o3 models
Disable streaming for o1/o3 in your abstraction layer. Create a model-aware fallback that buffers the full response instead of streaming.
Attempting to use astream() async method with o1/o3 models
Switch to the standard create() method (not astream() or stream()) for reasoning models. Use asyncio.to_thread() if you need async compatibility without streaming.
Code: broken vs fixed
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
# BROKEN: stream=True is not supported with o1
response = client.chat.completions.create(
model='o1',
messages=[{'role': 'user', 'content': 'Solve this math problem: 2^100'}],
stream=True, # ❌ This causes NotSupportedError
max_completion_tokens=4000
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end='', flush=True) import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
# FIXED: Removed stream=True for o1 (reasoning models do not support streaming)
response = client.chat.completions.create(
model='o1',
messages=[{'role': 'user', 'content': 'Solve this math problem: 2^100'}],
# stream=True is removed — o1 only supports non-streaming responses
max_completion_tokens=4000
)
# Full response is returned immediately
print('Reasoning:')
print(response.choices[0].message.content)
print(f'\nInput tokens: {response.usage.prompt_tokens}')
print(f'Output tokens: {response.usage.completion_tokens}') Workaround
If you have a UI expecting streaming behavior, buffer the full o1 response and yield it in chunks on the client side. Store the complete response from client.chat.completions.create(model='o1', stream=False, ...), then split response.choices[0].message.content by sentence or word and send to client in timed intervals (e.g., asyncio.sleep(0.1) between chunks). This gives the appearance of streaming while respecting o1's non-streaming requirement.
Prevention
Build a model-aware abstraction layer that checks model type before setting streaming. Create an enum or config table mapping models to their streaming capability: SUPPORTS_STREAMING = {'gpt-4o', 'gpt-4o-mini', 'gpt-3.5-turbo', ...}; REASONING_ONLY = {'o1', 'o3', 'o3-mini'}. Use this in your request builder: if model in REASONING_ONLY: stream=False. Document this limitation in your API layer or RAG system so developers know o1/o3 are not suitable for real-time streaming responses.