Iterating over response chunks
Why this matters
Streaming responses reduces perceived latency in user-facing applications: users see text appearing in real-time rather than waiting for a complete response. This is essential for building responsive chatbots and real-time content generation UI.
Explanation
What this does: The Gemini API's stream=True parameter returns responses as they are generated, one chunk at a time, instead of waiting for the entire response. Each chunk arrives as a separate object you can iterate over immediately.
How it works: When you set stream=True, generate_content() returns an iterator. The API sends tokens to your client as the model generates them. Your code processes each chunk in a for loop, allowing you to display partial results or accumulate the full response piece by piece.
When to use it: Use streaming for interactive applications (chatbots, content writers, code generators) where users benefit from seeing results appear progressively. Avoid streaming for batch jobs, APIs that expect complete responses, or when network reliability is low.
Request code
import google.generativeai as genai
import os
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')
prompt = 'Write a short poem about Python programming'
response = model.generate_content(prompt, stream=True)
full_text = ''
for chunk in response:
if chunk.text:
print(chunk.text, end='', flush=True)
full_text += chunk.text
print() # newline after streaming completes
print(f'\nTotal characters received: {len(full_text)}') Authentication
Set your Google API key before making requests. Export it as an environment variable: export GOOGLE_API_KEY='your-key-here'. The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']).
Response shape
| Field | Description |
|---|---|
chunk | A GenerateContentResponse object with partial model output |
chunk.text | String containing the text generated in this chunk (may be empty) |
chunk.finish_reason | String indicating why generation stopped (e.g., 'STOP', 'MAX_TOKENS'): only present on final chunk |
chunk.usage_metadata | Token usage data (input/output tokens): only present on final chunk |
Field guide
chunk.text The actual text content generated. Check if non-empty before processing, as some chunks may contain only metadata.
chunk.finish_reason Overlooked field that tells you why streaming ended: crucial for detecting truncation or errors. Only appears on the last chunk, not every iteration.
Setup trap
The GOOGLE_API_KEY environment variable must be set before importing or calling genai.configure(). Setting it mid-script after instantiating the model will not work. Verify with print(os.environ.get('GOOGLE_API_KEY')) before configuring.
Cost
Streaming requests are billed identically to non-streaming requests: you pay per input and output token, regardless of how you consume the response. The cost structure is identical; streaming is purely a UX optimization, not a cost-saving technique.
Rate limits
Streaming requests count against your concurrent request limit. If you iterate slowly over chunks, you hold an open connection longer, consuming your concurrency quota. Keep iteration fast or implement backpressure in production.
Common gotcha
Accessing response.text directly on a streaming response raises an error. You must iterate over chunks and accumulate text manually. The iterator is consumed after one pass: you cannot loop twice over the same response object.
Error recovery
StopIteration or AttributeError on iterationAPIError during streaminggoogle.api_core.exceptions.InvalidArgumentExperienced dev note
Streaming makes perceived latency disappear but introduces state management complexity. Always accumulate chunks into a variable (like full_text) in case you need the complete response for logging, validation, or retry logic. Also: check chunk.finish_reason before assuming the response is complete: a STOP reason means the model finished naturally; MAX_TOKENS means it hit the limit and may have cut off mid-sentence.
Check your understanding
If your streaming loop finishes but you realize you need the full response's usage_metadata (token counts), what should you have done differently?
Show answer hint
usage_metadata only appears on the final chunk. You must save it during iteration or call response.usage_metadata after the loop completes: but you can only iterate once. Plan to capture metadata during the loop.