API Beginner easy · 5 min

Token usage in streaming responses

What you will learn
Streaming responses don't include token counts in the stream: you must calculate them yourself or use the non-streamed response.

Why this matters

If you're billing users, monitoring costs, or enforcing token limits, streaming hides the token usage data that non-streamed responses provide. Understanding where token counts live (and don't live) prevents silent cost overruns and broken quota systems.

Skip if: Use non-streamed responses (stream=False) when you need token counts immediately and latency isn't critical. Use streaming when real-time feedback matters more than upfront token visibility, accepting that you'll need to estimate or log separately.

Explanation

When you call client.chat.completions.create() with stream=True, the API sends response chunks as they generate, allowing you to display text to users in real-time. However, each chunk is a StreamingChatCompletionChunk object containing only partial text: not token usage data. The usage field (which contains prompt_tokens, completion_tokens, and total_tokens) is None on every chunk except possibly the final one.

Non-streamed responses (stream=False) return a single ChatCompletion object with complete usage metadata immediately. Streamed responses trade this upfront visibility for latency reduction. The API doesn't calculate final token counts until the response is complete, which is after you've already started consuming the stream.

To track costs and token usage with streaming, you have three options: (1) calculate tokens client-side using a tokenizer library like tiktoken, (2) make a separate non-streamed request for token estimation, or (3) accept that token counts arrive late and log them after the stream completes. Most production systems estimate tokens during streaming and reconcile with actual counts in logs.

Request code

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))

print('=== Streaming (token usage is None) ===')
with client.chat.completions.create(
    model='gpt-4.1',
    messages=[{'role': 'user', 'content': 'Explain quantum computing in 2 sentences.'}],
    stream=True
) as stream:
    full_text = ''
    for chunk in stream:
        if chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_text += content
            print(content, end='', flush=True)
        if chunk.usage:
            print(f'\nUsage in chunk: {chunk.usage}')
        else:
            print(f'\nUsage is None in this chunk')

print(f'\nFull response: {full_text}')
print(f'\n=== Non-streaming (token usage is available) ===')
response = client.chat.completions.create(
    model='gpt-4.1',
    messages=[{'role': 'user', 'content': 'Explain quantum computing in 2 sentences.'}],
    stream=False
)
print(f'Response: {response.choices[0].message.content}')
print(f'Token usage: {response.usage}')
print(f'Prompt tokens: {response.usage.prompt_tokens}')
print(f'Completion tokens: {response.usage.completion_tokens}')
print(f'Total tokens: {response.usage.total_tokens}')

Authentication

Set your API key before instantiating the client: ```bash export OPENAI_API_KEY='sk-...' ``` Or pass it directly: ```python from openai import OpenAI client = OpenAI(api_key='sk-...') ``` The SDK reads OPENAI_API_KEY from the environment at instantiation time.

Response shape

FieldDescription
StreamingChatCompletionChunk [object Object]
ChatCompletion (non-streamed) [object Object]

Field guide

usage

Present on non-streamed responses with exact token counts. On streamed responses, almost always None until the stream ends: plan accordingly.

choices[0].delta.content

On streaming chunks, this is the text fragment for this chunk. On non-streamed responses, use choices[0].message.content instead.

finish_reason

Tells you why the response ended ('stop' means normal completion, 'length' means max tokens hit, 'tool_calls' means function calling). In streaming, this arrives in the final chunk.

Setup trap

Forgetting that stream=True returns a generator/context manager, not a complete response object. If you try to access response.usage directly on a streamed response, you'll get None or an error. You must consume the stream (iterate through chunks) before usage data is available: and even then it may be None.

Cost

Streaming incurs the same per-token cost as non-streamed requests (same pricing structure). However, if you lose visibility of token usage and build incomplete logging, you may underbill users or miss quota enforcement, leading to unexpected overages. A single misconfigured streaming endpoint can silently cost 10-100x your estimate.

Rate limits

Streaming responses do not reduce your rate limit hit. Each request (streamed or not) counts as one toward your rate limits. If you're streaming to reduce latency but making more concurrent requests, you may hit limits faster.

Common gotcha

Developers assume token usage arrives with the first chunk or at regular intervals. It doesn't. Most chunks have usage=None. If you're building a cost dashboard and only log the first chunk, you'll show 0 tokens. Always check if usage is not None before reading it, or wait until the stream completes.

Error recovery

AttributeError: 'NoneType' object has no attribute 'prompt_tokens'
You're trying to read response.usage.prompt_tokens on a streaming response. Use a non-streamed request for immediate usage, or consume the full stream and estimate with tiktoken.
TypeError: 'StreamingChatCompletionChunk' object is not subscriptable
You're treating a chunk like a dict (chunk['usage']). Use dot notation: chunk.usage. Chunks are objects, not dicts.
usage is None after stream completes
The API doesn't always include usage in the final chunk. If you need exact counts, make a separate non-streamed request or use tiktoken to estimate from the full response text.

Experienced dev note

In production, use tiktoken to estimate tokens client-side during streaming, then log actual usage from a separate non-streamed request at 5% sample rate (enough to catch drift, not expensive). This gives you real-time user feedback while preserving cost visibility. Never rely solely on chunk.usage: it's usually None.

Check your understanding

You're building a streaming chat endpoint where users pay per token. You stream the response for fast UX but also need accurate billing. Why can't you simply read chunk.usage from the stream and charge immediately? What would happen if you tried?

Show answer hint

chunk.usage is None on almost all chunks. Even if it weren't, the final token count isn't known until the stream ends. You'd either charge nothing or charge incomplete amounts, breaking your billing model.

VERSION In openai 1.x (current), streaming chunks are StreamingChatCompletionChunk objects. In older versions (pre-1.0), the streaming format was different. Always use the 1.x SDK pattern shown here.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.