API Beginner easy · 5 min

Iterating over response chunks

What you will learn

Stream Gemini API responses chunk-by-chunk instead of waiting for the full response to arrive.

Why this matters

Streaming responses reduces perceived latency in user-facing applications: users see text appearing in real-time rather than waiting for a complete response. This is essential for building responsive chatbots and real-time content generation UI.

Skip if: Use non-streaming responses when you need the complete, validated output before processing (e.g., structured extraction, batch processing, or when latency is not a user-facing concern). Streaming adds complexity and makes error handling harder.

Explanation

What this does: The Gemini API's stream=True parameter returns responses as they are generated, one chunk at a time, instead of waiting for the entire response. Each chunk arrives as a separate object you can iterate over immediately.

How it works: When you set stream=True, generate_content() returns an iterator. The API sends tokens to your client as the model generates them. Your code processes each chunk in a for loop, allowing you to display partial results or accumulate the full response piece by piece.

When to use it: Use streaming for interactive applications (chatbots, content writers, code generators) where users benefit from seeing results appear progressively. Avoid streaming for batch jobs, APIs that expect complete responses, or when network reliability is low.

Request code

python

import google.generativeai as genai
import os

genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
model = genai.GenerativeModel('gemini-2.0-flash')

prompt = 'Write a short poem about Python programming'

response = model.generate_content(prompt, stream=True)

full_text = ''
for chunk in response:
    if chunk.text:
        print(chunk.text, end='', flush=True)
        full_text += chunk.text

print()  # newline after streaming completes
print(f'\nTotal characters received: {len(full_text)}')

Authentication

Set your Google API key before making requests. Export it as an environment variable: export GOOGLE_API_KEY='your-key-here'. The SDK reads this automatically when you call genai.configure(api_key=os.environ['GOOGLE_API_KEY']).

Response shape

Field	Description
`chunk`	A GenerateContentResponse object with partial model output
`chunk.text`	String containing the text generated in this chunk (may be empty)
`chunk.finish_reason`	String indicating why generation stopped (e.g., 'STOP', 'MAX_TOKENS'): only present on final chunk
`chunk.usage_metadata`	Token usage data (input/output tokens): only present on final chunk

Field guide

chunk.text

The actual text content generated. Check if non-empty before processing, as some chunks may contain only metadata.

chunk.finish_reason

Overlooked field that tells you why streaming ended: crucial for detecting truncation or errors. Only appears on the last chunk, not every iteration.

Setup trap

The GOOGLE_API_KEY environment variable must be set before importing or calling genai.configure(). Setting it mid-script after instantiating the model will not work. Verify with print(os.environ.get('GOOGLE_API_KEY')) before configuring.

Cost

Streaming requests are billed identically to non-streaming requests: you pay per input and output token, regardless of how you consume the response. The cost structure is identical; streaming is purely a UX optimization, not a cost-saving technique.

Rate limits

Streaming requests count against your concurrent request limit. If you iterate slowly over chunks, you hold an open connection longer, consuming your concurrency quota. Keep iteration fast or implement backpressure in production.

Common gotcha

Accessing response.text directly on a streaming response raises an error. You must iterate over chunks and accumulate text manually. The iterator is consumed after one pass: you cannot loop twice over the same response object.

Error recovery

StopIteration or AttributeError on iteration

You accessed response.text before streaming started or the model returned no content. Always iterate chunks first; only accumulate if chunk.text is truthy.

APIError during streaming

Network interruption mid-stream. Implement try-except around the loop and handle partial responses. The iterator will stop, and you'll have whatever was received up to that point.

google.api_core.exceptions.InvalidArgument

You passed an invalid parameter to generate_content(). Verify stream=True is a boolean and the prompt is a string. Check your genai version is 0.8.x or later.

Experienced dev note

Streaming makes perceived latency disappear but introduces state management complexity. Always accumulate chunks into a variable (like full_text) in case you need the complete response for logging, validation, or retry logic. Also: check chunk.finish_reason before assuming the response is complete: a STOP reason means the model finished naturally; MAX_TOKENS means it hit the limit and may have cut off mid-sentence.

Check your understanding

If your streaming loop finishes but you realize you need the full response's usage_metadata (token counts), what should you have done differently?

Show answer hint

usage_metadata only appears on the final chunk. You must save it during iteration or call response.usage_metadata after the loop completes: but you can only iterate once. Plan to capture metadata during the loop.

VERSION google-generativeai 0.8.x uses LCEL-style streaming. Older versions (0.1.x) used deprecated Streaming classes. Always verify you are on 0.8.x or later by running: python -c 'import google.generativeai as genai; print(genai.__version__)'

Community Notes

No notes yetBe the first to share a version-specific fix or tip.