API Intermediate medium · 6 min

Streaming in chat sessions

What you will learn
Stream token-by-token responses in multi-turn conversations to reduce perceived latency and improve UX.

Why this matters

Chat interfaces need to feel responsive. Streaming lets you display tokens as they arrive instead of waiting for the full response, dramatically improving perceived performance and user experience in conversational AI.

Skip if: Use non-streaming responses when you need the complete final answer before processing (e.g., parsing structured output, validation logic, or when the overhead of managing a stream isn't worth the latency savings for short responses).

Explanation

What it does: The stream=True parameter on chat sessions returns tokens as they're generated, rather than waiting for the full completion. You iterate over the response stream and display or process each chunk in real-time.

How it works: When you call chat_session.send_message(prompt, stream=True), the API opens a persistent connection and begins yielding ContentPart objects containing partial text. Each iteration gives you the next token or token chunk. The connection stays open until generation completes. Internally, Gemini is still generating the full response: streaming just changes how it's delivered to you.

When to use it: Always enable streaming for chat UIs, customer-facing assistants, or any scenario where users wait for responses. Disable it only when you need the full response atomically (structured parsing, conditional logic on the complete output, or testing).

Request code

python
import os
import google.generativeai as genai

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
chat_session = model.start_chat(history=[])

prompt = 'Explain quantum entanglement in three sentences.'
print('Streaming response:')
print('-' * 40)

response = chat_session.send_message(prompt, stream=True)

full_text = ''
for chunk in response:
    if chunk.text:
        print(chunk.text, end='', flush=True)
        full_text += chunk.text

print('\n' + '-' * 40)
print(f'\nTotal tokens in response: {response.usage_metadata.output_tokens}')
print(f'Total tokens in conversation: {response.usage_metadata.cache_read_input_tokens + response.usage_metadata.input_tokens + response.usage_metadata.output_tokens}')

Authentication

Set your API key as an environment variable before running code: export GOOGLE_API_KEY='your-api-key-here' Then import and configure in Python: import os import google.generativeai as genai genai.configure(api_key=os.environ['GOOGLE_API_KEY'])

Response shape

FieldDescription
chunk ContentPart object from iterator
chunk.text str: partial text token(s) for this iteration, may be empty
chunk.parts list[Part]: lower-level part structure, rarely needed
response.usage_metadata UsageMetadata object with input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens
response.prompt_feedback PromptFeedback: safety ratings for the original prompt

Field guide

chunk.text

The actual human-readable text to display. Always check if it's non-empty before appending: some chunks may contain only metadata.

response.usage_metadata.cache_read_input_tokens

Tokens read from prompt cache (if enabled). Critical for cost optimization: cached tokens cost 10% of regular input tokens. Developers often miss this field and don't realize they're getting free reads.

response.usage_metadata.output_tokens

Only populated after the entire stream completes. Do not rely on this mid-stream.

Setup trap

The response iterator is consumed once. If you iterate over it twice (once to display, once to parse), the second loop will be empty. Capture text during the first iteration or collect chunks in a list before processing.

Cost

Streaming costs the same as non-streaming per token. However, streaming + prompt caching is powerful: if your chat has a cached context, those tokens cost 10% of regular input tokens, but only on the first request. Subsequent requests in the same chat read from cache at 10% cost. A 100k-token context cached costs ~10k tokens on reads instead of 100k.

Rate limits

Streaming counts the same toward rate limits as non-streaming (tokens per minute). However, the perceived latency improvement means users are less likely to retry, reducing duplicate requests and actual rate limit hits.

Common gotcha

Trying to access response.text or final usage metadata before the stream completes. The response object is not fully populated until you've consumed the entire iterator. Always iterate through all chunks first, then access aggregated metadata.

Error recovery

StopIteration
Iterator exhausted before expected. Cause: accessing the stream iterator twice. Fix: collect all chunks in a list on first pass, or reconstruct the chat message from usage_metadata.
AttributeError: 'NoneType' has no attribute 'text'
Chunk returned None or missing .text field. Cause: internal streaming packet without text payload. Fix: always guard with `if chunk.text:` before using.
APIError with 429 status
Rate limited. Cause: too many concurrent streams or requests in short time. Fix: add exponential backoff with jitter between requests, or batch messages.
APIConnectionError mid-stream
Connection dropped during streaming. Cause: network timeout, server restart, or client timeout. Fix: wrap the loop in try/except, cache the full_text accumulated so far, and resume from the last confirmed message.

Experienced dev note

Streaming + prompt caching is a hidden multiplier. If you're sending the same system prompt or context repeatedly (documentation, codebase, RAG context), cache it once with cache_config=genai.caching.CachedContent(...) and reuse the cache ID. You'll pay full price on the first message, but subsequent messages read that context at 10% cost. For large RAG workloads, this is the difference between 10¢ and $1 per query. Also: the response.usage_metadata only reflects the final state after the stream ends: use it for accurate cost attribution, not for mid-stream token counting.

Check your understanding

If you iterate over a streamed chat response, accumulate text in a variable, then send a follow-up message to the same chat session, why might the second response be cheaper (in tokens) than the first, and what field would tell you this is happening?

Show answer hint

Prompt caching caches the chat history automatically. On the second message, the first message + system context are already cached, so <code>response.usage_metadata.cache_read_input_tokens</code> will be non-zero, reducing <code>input_tokens</code> and lowering cost.

VERSION google-generativeai 0.8.x uses stream=True parameter. Older 0.1.x versions used response_type=ResponseType.STREAMING. Always pin google-generativeai>=0.8.0 in requirements.txt. Streaming support is stable across gemini-2.0-flash, gemini-2.5-pro, and gemini-1.5-pro.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.