Code intermediate · 3 min read

How to stream Gemini API response in python

Direct answer
Use the Gemini API Python client with the stream=True parameter in client.chat.completions.create() to receive streamed tokens as they arrive.

Setup

Install
bash
pip install google-ai-generativelanguage
Env vars
GOOGLE_API_KEY
Imports
python
from google.ai import generativelanguage
import os

Examples

inUser prompt: 'Explain quantum computing in simple terms.'
outStreaming tokens outputting a step-by-step explanation as they arrive.
inUser prompt: 'Write a Python function to reverse a string.'
outStreaming tokens outputting the Python code snippet progressively.
inUser prompt: 'Summarize the latest AI trends.'
outStreaming tokens outputting a concise summary in real time.

Integration steps

  1. Install the Google Generative Language SDK and set the GOOGLE_API_KEY environment variable.
  2. Import the client and initialize it with the API key from os.environ.
  3. Create a chat completion request with the stream=True parameter.
  4. Iterate over the streamed response chunks as they arrive.
  5. Extract and process the incremental token content from each chunk.
  6. Combine or display tokens in real time for a streaming user experience.

Full code

python
from google.ai import generativelanguage
import os

def stream_gemini_response():
    client = generativelanguage.LanguageServiceClient()
    model = "models/chat-bison-001"

    prompt = "Explain the benefits of renewable energy."

    # Create the chat completion request with streaming enabled
    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[
            generativelanguage.TextPrompt(text=prompt)
        ],
        temperature=0.7,
        candidate_count=1,
        max_output_tokens=256,
        stream=True
    )

    # Stream the response
    stream = client.generate_message(request=request)

    print("Streaming response:")
    for response_chunk in stream:
        # Each chunk contains partial message content
        if response_chunk.candidates:
            for candidate in response_chunk.candidates:
                if candidate.content:
                    print(candidate.content, end='', flush=True)
    print()  # Newline after streaming completes

if __name__ == "__main__":
    stream_gemini_response()
output
Streaming response:
Renewable energy offers sustainable power sources that reduce greenhouse gas emissions, decrease dependence on fossil fuels, and promote environmental health.

API trace

Request
json
{"model": "models/chat-bison-001", "prompt": [{"text": "Explain the benefits of renewable energy."}], "stream": true, "max_output_tokens": 256}
Response
json
{"candidates": [{"content": "partial token text"}], "done": false}
ExtractIterate over streamed chunks and concatenate candidate.content fields

Variants

Async streaming with Gemini API in Python

Use async streaming to handle multiple concurrent Gemini API calls efficiently in an async Python environment.

python
import asyncio
from google.ai import generativelanguage

async def async_stream_gemini_response():
    client = generativelanguage.LanguageServiceAsyncClient()
    model = "models/chat-bison-001"

    prompt = "Describe the process of photosynthesis."

    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[generativelanguage.TextPrompt(text=prompt)],
        stream=True
    )

    stream = await client.generate_message(request=request)

    print("Async streaming response:")
    async for response_chunk in stream:
        if response_chunk.candidates:
            for candidate in response_chunk.candidates:
                if candidate.content:
                    print(candidate.content, end='', flush=True)
    print()

if __name__ == "__main__":
    asyncio.run(async_stream_gemini_response())
Non-streaming Gemini API call in Python

Use non-streaming calls when you want the full response at once and do not need incremental token updates.

python
from google.ai import generativelanguage
import os

def non_stream_gemini_response():
    client = generativelanguage.LanguageServiceClient()
    model = "models/chat-bison-001"

    prompt = "List the top 5 programming languages in 2026."

    request = generativelanguage.GenerateMessageRequest(
        model=model,
        prompt=[generativelanguage.TextPrompt(text=prompt)],
        max_output_tokens=128
    )

    response = client.generate_message(request=request)
    print("Response:", response.candidates[0].content)

if __name__ == "__main__":
    non_stream_gemini_response()

Performance

Latency~500-1000ms initial token delay, then tokens stream in real time
Cost~$0.0015 per 1000 tokens for chat-bison-001 model
Rate limitsTier 1: 600 RPM / 60K TPM
  • Use concise prompts to reduce token usage.
  • Limit <code>max_output_tokens</code> to control response length.
  • Reuse context efficiently to avoid redundant tokens.
ApproachLatencyCost/callBest for
Streaming (sync)~500ms + streaming~$0.0015/1K tokensReal-time UI updates
Streaming (async)~500ms + streaming~$0.0015/1K tokensConcurrent calls in async apps
Non-streaming~800ms total~$0.0015/1K tokensSimple batch processing

Quick tip

Always set <code>stream=True</code> in your request to receive Gemini API responses token-by-token for real-time display.

Common mistake

Forgetting to set <code>stream=True</code> results in the entire response being returned only after completion, losing streaming benefits.

Verified 2026-04 · models/chat-bison-001
Verify ↗