Token usage in streaming responses
Why this matters
If you're billing users, monitoring costs, or enforcing token limits, streaming hides the token usage data that non-streamed responses provide. Understanding where token counts live (and don't live) prevents silent cost overruns and broken quota systems.
Explanation
When you call client.chat.completions.create() with stream=True, the API sends response chunks as they generate, allowing you to display text to users in real-time. However, each chunk is a StreamingChatCompletionChunk object containing only partial text: not token usage data. The usage field (which contains prompt_tokens, completion_tokens, and total_tokens) is None on every chunk except possibly the final one.
Non-streamed responses (stream=False) return a single ChatCompletion object with complete usage metadata immediately. Streamed responses trade this upfront visibility for latency reduction. The API doesn't calculate final token counts until the response is complete, which is after you've already started consuming the stream.
To track costs and token usage with streaming, you have three options: (1) calculate tokens client-side using a tokenizer library like tiktoken, (2) make a separate non-streamed request for token estimation, or (3) accept that token counts arrive late and log them after the stream completes. Most production systems estimate tokens during streaming and reconcile with actual counts in logs.
Request code
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
print('=== Streaming (token usage is None) ===')
with client.chat.completions.create(
model='gpt-4.1',
messages=[{'role': 'user', 'content': 'Explain quantum computing in 2 sentences.'}],
stream=True
) as stream:
full_text = ''
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_text += content
print(content, end='', flush=True)
if chunk.usage:
print(f'\nUsage in chunk: {chunk.usage}')
else:
print(f'\nUsage is None in this chunk')
print(f'\nFull response: {full_text}')
print(f'\n=== Non-streaming (token usage is available) ===')
response = client.chat.completions.create(
model='gpt-4.1',
messages=[{'role': 'user', 'content': 'Explain quantum computing in 2 sentences.'}],
stream=False
)
print(f'Response: {response.choices[0].message.content}')
print(f'Token usage: {response.usage}')
print(f'Prompt tokens: {response.usage.prompt_tokens}')
print(f'Completion tokens: {response.usage.completion_tokens}')
print(f'Total tokens: {response.usage.total_tokens}') Authentication
Set your API key before instantiating the client: ```bash export OPENAI_API_KEY='sk-...' ``` Or pass it directly: ```python from openai import OpenAI client = OpenAI(api_key='sk-...') ``` The SDK reads OPENAI_API_KEY from the environment at instantiation time.
Response shape
| Field | Description |
|---|---|
StreamingChatCompletionChunk | [object Object] |
ChatCompletion (non-streamed) | [object Object] |
Field guide
usage Present on non-streamed responses with exact token counts. On streamed responses, almost always None until the stream ends: plan accordingly.
choices[0].delta.content On streaming chunks, this is the text fragment for this chunk. On non-streamed responses, use choices[0].message.content instead.
finish_reason Tells you why the response ended ('stop' means normal completion, 'length' means max tokens hit, 'tool_calls' means function calling). In streaming, this arrives in the final chunk.
Setup trap
Forgetting that stream=True returns a generator/context manager, not a complete response object. If you try to access response.usage directly on a streamed response, you'll get None or an error. You must consume the stream (iterate through chunks) before usage data is available: and even then it may be None.
Cost
Streaming incurs the same per-token cost as non-streamed requests (same pricing structure). However, if you lose visibility of token usage and build incomplete logging, you may underbill users or miss quota enforcement, leading to unexpected overages. A single misconfigured streaming endpoint can silently cost 10-100x your estimate.
Rate limits
Streaming responses do not reduce your rate limit hit. Each request (streamed or not) counts as one toward your rate limits. If you're streaming to reduce latency but making more concurrent requests, you may hit limits faster.
Common gotcha
Developers assume token usage arrives with the first chunk or at regular intervals. It doesn't. Most chunks have usage=None. If you're building a cost dashboard and only log the first chunk, you'll show 0 tokens. Always check if usage is not None before reading it, or wait until the stream completes.
Error recovery
AttributeError: 'NoneType' object has no attribute 'prompt_tokens'TypeError: 'StreamingChatCompletionChunk' object is not subscriptableusage is None after stream completesExperienced dev note
In production, use tiktoken to estimate tokens client-side during streaming, then log actual usage from a separate non-streamed request at 5% sample rate (enough to catch drift, not expensive). This gives you real-time user feedback while preserving cost visibility. Never rely solely on chunk.usage: it's usually None.
Check your understanding
You're building a streaming chat endpoint where users pay per token. You stream the response for fast UX but also need accurate billing. Why can't you simply read chunk.usage from the stream and charge immediately? What would happen if you tried?
Show answer hint
chunk.usage is None on almost all chunks. Even if it weren't, the final token count isn't known until the stream ends. You'd either charge nothing or charge incomplete amounts, breaking your billing model.