How to stream Gemini API response in python
Direct answer
Use the Gemini API Python client with the
stream=True parameter in client.chat.completions.create() to receive streamed tokens as they arrive.Setup
Install
pip install google-ai-generativelanguage Env vars
GOOGLE_API_KEY Imports
from google.ai import generativelanguage
import os Examples
inUser prompt: 'Explain quantum computing in simple terms.'
outStreaming tokens outputting a step-by-step explanation as they arrive.
inUser prompt: 'Write a Python function to reverse a string.'
outStreaming tokens outputting the Python code snippet progressively.
inUser prompt: 'Summarize the latest AI trends.'
outStreaming tokens outputting a concise summary in real time.
Integration steps
- Install the Google Generative Language SDK and set the GOOGLE_API_KEY environment variable.
- Import the client and initialize it with the API key from os.environ.
- Create a chat completion request with the
stream=Trueparameter. - Iterate over the streamed response chunks as they arrive.
- Extract and process the incremental token content from each chunk.
- Combine or display tokens in real time for a streaming user experience.
Full code
from google.ai import generativelanguage
import os
def stream_gemini_response():
client = generativelanguage.LanguageServiceClient()
model = "models/chat-bison-001"
prompt = "Explain the benefits of renewable energy."
# Create the chat completion request with streaming enabled
request = generativelanguage.GenerateMessageRequest(
model=model,
prompt=[
generativelanguage.TextPrompt(text=prompt)
],
temperature=0.7,
candidate_count=1,
max_output_tokens=256,
stream=True
)
# Stream the response
stream = client.generate_message(request=request)
print("Streaming response:")
for response_chunk in stream:
# Each chunk contains partial message content
if response_chunk.candidates:
for candidate in response_chunk.candidates:
if candidate.content:
print(candidate.content, end='', flush=True)
print() # Newline after streaming completes
if __name__ == "__main__":
stream_gemini_response() output
Streaming response: Renewable energy offers sustainable power sources that reduce greenhouse gas emissions, decrease dependence on fossil fuels, and promote environmental health.
API trace
Request
{"model": "models/chat-bison-001", "prompt": [{"text": "Explain the benefits of renewable energy."}], "stream": true, "max_output_tokens": 256} Response
{"candidates": [{"content": "partial token text"}], "done": false} Extract
Iterate over streamed chunks and concatenate candidate.content fieldsVariants
Async streaming with Gemini API in Python ›
Use async streaming to handle multiple concurrent Gemini API calls efficiently in an async Python environment.
import asyncio
from google.ai import generativelanguage
async def async_stream_gemini_response():
client = generativelanguage.LanguageServiceAsyncClient()
model = "models/chat-bison-001"
prompt = "Describe the process of photosynthesis."
request = generativelanguage.GenerateMessageRequest(
model=model,
prompt=[generativelanguage.TextPrompt(text=prompt)],
stream=True
)
stream = await client.generate_message(request=request)
print("Async streaming response:")
async for response_chunk in stream:
if response_chunk.candidates:
for candidate in response_chunk.candidates:
if candidate.content:
print(candidate.content, end='', flush=True)
print()
if __name__ == "__main__":
asyncio.run(async_stream_gemini_response()) Non-streaming Gemini API call in Python ›
Use non-streaming calls when you want the full response at once and do not need incremental token updates.
from google.ai import generativelanguage
import os
def non_stream_gemini_response():
client = generativelanguage.LanguageServiceClient()
model = "models/chat-bison-001"
prompt = "List the top 5 programming languages in 2026."
request = generativelanguage.GenerateMessageRequest(
model=model,
prompt=[generativelanguage.TextPrompt(text=prompt)],
max_output_tokens=128
)
response = client.generate_message(request=request)
print("Response:", response.candidates[0].content)
if __name__ == "__main__":
non_stream_gemini_response() Performance
Latency~500-1000ms initial token delay, then tokens stream in real time
Cost~$0.0015 per 1000 tokens for chat-bison-001 model
Rate limitsTier 1: 600 RPM / 60K TPM
- Use concise prompts to reduce token usage.
- Limit <code>max_output_tokens</code> to control response length.
- Reuse context efficiently to avoid redundant tokens.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Streaming (sync) | ~500ms + streaming | ~$0.0015/1K tokens | Real-time UI updates |
| Streaming (async) | ~500ms + streaming | ~$0.0015/1K tokens | Concurrent calls in async apps |
| Non-streaming | ~800ms total | ~$0.0015/1K tokens | Simple batch processing |
Quick tip
Always set <code>stream=True</code> in your request to receive Gemini API responses token-by-token for real-time display.
Common mistake
Forgetting to set <code>stream=True</code> results in the entire response being returned only after completion, losing streaming benefits.