Code intermediate · 3 min read

How to stream Gemini API response in python

Q: How to stream Gemini API response in python

Use the Gemini API Python client with the stream=True parameter in client.chat.completions.create() to receive streamed tokens as they arrive.

Direct answer

Use the Gemini API Python client with the stream=True parameter in client.chat.completions.create() to receive streamed tokens as they arrive.

Setup

Install

bash

pip install google-ai-generativelanguage

Env vars

GOOGLE_API_KEY

Imports

python

from google.ai import generativelanguage
import os

Examples

inUser prompt: 'Explain quantum computing in simple terms.'

outStreaming tokens outputting a step-by-step explanation as they arrive.

inUser prompt: 'Write a Python function to reverse a string.'

outStreaming tokens outputting the Python code snippet progressively.

inUser prompt: 'Summarize the latest AI trends.'

outStreaming tokens outputting a concise summary in real time.

Integration steps

Install the Google Generative Language SDK and set the GOOGLE_API_KEY environment variable.
Import the client and initialize it with the API key from os.environ.
Create a chat completion request with the stream=True parameter.
Iterate over the streamed response chunks as they arrive.
Extract and process the incremental token content from each chunk.
Combine or display tokens in real time for a streaming user experience.

Full code

python

from google.ai import generativelanguage
import os

def stream_gemini_response():
    client = generativelanguage.LanguageServiceClient()
    model = "models/chat-bison-001"

    prompt = "Explain the benefits of renewable energy."

    # Create the chat completion request with streaming enabled
    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[
            generativelanguage.TextPrompt(text=prompt)
        ],
        temperature=0.7,
        candidate_count=1,
        max_output_tokens=256,
        stream=True
    )

    # Stream the response
    stream = client.generate_message(request=request)

    print("Streaming response:")
    for response_chunk in stream:
        # Each chunk contains partial message content
        if response_chunk.candidates:
            for candidate in response_chunk.candidates:
                if candidate.content:
                    print(candidate.content, end='', flush=True)
    print()  # Newline after streaming completes

if __name__ == "__main__":
    stream_gemini_response()

output

Streaming response:
Renewable energy offers sustainable power sources that reduce greenhouse gas emissions, decrease dependence on fossil fuels, and promote environmental health.

API trace

Request

json

{"model": "models/chat-bison-001", "prompt": [{"text": "Explain the benefits of renewable energy."}], "stream": true, "max_output_tokens": 256}

Response

json

{"candidates": [{"content": "partial token text"}], "done": false}

ExtractIterate over streamed chunks and concatenate candidate.content fields

Variants

Async streaming with Gemini API in Python ›

Use async streaming to handle multiple concurrent Gemini API calls efficiently in an async Python environment.

python

import asyncio
from google.ai import generativelanguage

async def async_stream_gemini_response():
    client = generativelanguage.LanguageServiceAsyncClient()
    model = "models/chat-bison-001"

    prompt = "Describe the process of photosynthesis."

    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[generativelanguage.TextPrompt(text=prompt)],
        stream=True
    )

    stream = await client.generate_message(request=request)

    print("Async streaming response:")
    async for response_chunk in stream:
        if response_chunk.candidates:
            for candidate in response_chunk.candidates:
                if candidate.content:
                    print(candidate.content, end='', flush=True)
    print()

if __name__ == "__main__":
    asyncio.run(async_stream_gemini_response())

Non-streaming Gemini API call in Python ›

Use non-streaming calls when you want the full response at once and do not need incremental token updates.

python

from google.ai import generativelanguage
import os

def non_stream_gemini_response():
    client = generativelanguage.LanguageServiceClient()
    model = "models/chat-bison-001"

    prompt = "List the top 5 programming languages in 2026."

    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[generativelanguage.TextPrompt(text=prompt)],
        max_output_tokens=128
    )

    response = client.generate_message(request=request)
    print("Response:", response.candidates[0].content)

if __name__ == "__main__":
    non_stream_gemini_response()

Performance

Latency~500-1000ms initial token delay, then tokens stream in real time

Cost~$0.0015 per 1000 tokens for chat-bison-001 model

Rate limitsTier 1: 600 RPM / 60K TPM

Use concise prompts to reduce token usage.
Limit <code>max_output_tokens</code> to control response length.
Reuse context efficiently to avoid redundant tokens.

Approach	Latency	Cost/call	Best for
Streaming (sync)	~500ms + streaming	~$0.0015/1K tokens	Real-time UI updates
Streaming (async)	~500ms + streaming	~$0.0015/1K tokens	Concurrent calls in async apps
Non-streaming	~800ms total	~$0.0015/1K tokens	Simple batch processing

✓

Quick tip

Always set <code>stream=True</code> in your request to receive Gemini API responses token-by-token for real-time display.

⚠

Common mistake

Forgetting to set <code>stream=True</code> results in the entire response being returned only after completion, losing streaming benefits.

Verified 2026-04 · models/chat-bison-001

Verify ↗