Streaming in chat sessions
Why this matters
Chat interfaces need to feel responsive. Streaming lets you display tokens as they arrive instead of waiting for the full response, dramatically improving perceived performance and user experience in conversational AI.
Explanation
What it does: The stream=True parameter on chat sessions returns tokens as they're generated, rather than waiting for the full completion. You iterate over the response stream and display or process each chunk in real-time.
How it works: When you call chat_session.send_message(prompt, stream=True), the API opens a persistent connection and begins yielding ContentPart objects containing partial text. Each iteration gives you the next token or token chunk. The connection stays open until generation completes. Internally, Gemini is still generating the full response: streaming just changes how it's delivered to you.
When to use it: Always enable streaming for chat UIs, customer-facing assistants, or any scenario where users wait for responses. Disable it only when you need the full response atomically (structured parsing, conditional logic on the complete output, or testing).
Request code
import os
import google.generativeai as genai
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
chat_session = model.start_chat(history=[])
prompt = 'Explain quantum entanglement in three sentences.'
print('Streaming response:')
print('-' * 40)
response = chat_session.send_message(prompt, stream=True)
full_text = ''
for chunk in response:
if chunk.text:
print(chunk.text, end='', flush=True)
full_text += chunk.text
print('\n' + '-' * 40)
print(f'\nTotal tokens in response: {response.usage_metadata.output_tokens}')
print(f'Total tokens in conversation: {response.usage_metadata.cache_read_input_tokens + response.usage_metadata.input_tokens + response.usage_metadata.output_tokens}') Authentication
Set your API key as an environment variable before running code: export GOOGLE_API_KEY='your-api-key-here' Then import and configure in Python: import os import google.generativeai as genai genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
Response shape
| Field | Description |
|---|---|
chunk | ContentPart object from iterator |
chunk.text | str: partial text token(s) for this iteration, may be empty |
chunk.parts | list[Part]: lower-level part structure, rarely needed |
response.usage_metadata | UsageMetadata object with input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens |
response.prompt_feedback | PromptFeedback: safety ratings for the original prompt |
Field guide
chunk.text The actual human-readable text to display. Always check if it's non-empty before appending: some chunks may contain only metadata.
response.usage_metadata.cache_read_input_tokens Tokens read from prompt cache (if enabled). Critical for cost optimization: cached tokens cost 10% of regular input tokens. Developers often miss this field and don't realize they're getting free reads.
response.usage_metadata.output_tokens Only populated after the entire stream completes. Do not rely on this mid-stream.
Setup trap
The response iterator is consumed once. If you iterate over it twice (once to display, once to parse), the second loop will be empty. Capture text during the first iteration or collect chunks in a list before processing.
Cost
Streaming costs the same as non-streaming per token. However, streaming + prompt caching is powerful: if your chat has a cached context, those tokens cost 10% of regular input tokens, but only on the first request. Subsequent requests in the same chat read from cache at 10% cost. A 100k-token context cached costs ~10k tokens on reads instead of 100k.
Rate limits
Streaming counts the same toward rate limits as non-streaming (tokens per minute). However, the perceived latency improvement means users are less likely to retry, reducing duplicate requests and actual rate limit hits.
Common gotcha
Trying to access response.text or final usage metadata before the stream completes. The response object is not fully populated until you've consumed the entire iterator. Always iterate through all chunks first, then access aggregated metadata.
Error recovery
StopIterationAttributeError: 'NoneType' has no attribute 'text'APIError with 429 statusAPIConnectionError mid-streamExperienced dev note
Streaming + prompt caching is a hidden multiplier. If you're sending the same system prompt or context repeatedly (documentation, codebase, RAG context), cache it once with cache_config=genai.caching.CachedContent(...) and reuse the cache ID. You'll pay full price on the first message, but subsequent messages read that context at 10% cost. For large RAG workloads, this is the difference between 10¢ and $1 per query. Also: the response.usage_metadata only reflects the final state after the stream ends: use it for accurate cost attribution, not for mid-stream token counting.
Check your understanding
If you iterate over a streamed chat response, accumulate text in a variable, then send a follow-up message to the same chat session, why might the second response be cheaper (in tokens) than the first, and what field would tell you this is happening?
Show answer hint
Prompt caching caches the chat history automatically. On the second message, the first message + system context are already cached, so <code>response.usage_metadata.cache_read_input_tokens</code> will be non-zero, reducing <code>input_tokens</code> and lowering cost.
stream=True parameter. Older 0.1.x versions used response_type=ResponseType.STREAMING. Always pin google-generativeai>=0.8.0 in requirements.txt. Streaming support is stable across gemini-2.0-flash, gemini-2.5-pro, and gemini-1.5-pro.