How to beginner · 3 min read

How to stream Vertex AI responses

Quick answer
Use the vertexai Python SDK to stream responses from Vertex AI by calling model.generate_content() with stream=true. Iterate over the response asynchronously to process tokens as they arrive for real-time output.

PREREQUISITES

  • Python 3.8+
  • Google Cloud project with Vertex AI enabled
  • Google Cloud SDK installed and authenticated
  • pip install vertexai
  • Set environment variable GOOGLE_CLOUD_PROJECT
  • Set environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to your service account JSON key

Setup

Install the vertexai SDK and authenticate your Google Cloud environment. Ensure you have a service account JSON key with Vertex AI permissions and set the GOOGLE_APPLICATION_CREDENTIALS environment variable. Also, set your Google Cloud project ID in GOOGLE_CLOUD_PROJECT.

bash
pip install vertexai

Step by step

This example demonstrates streaming text generation from the gemini-2.0-flash model using the vertexai SDK. The stream=True parameter enables token-by-token streaming. The code asynchronously iterates over the response chunks and prints them in real time.

python
import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel

async def stream_vertex_ai():
    # Initialize Vertex AI with your project and location
    vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")

    # Load the Gemini model
    model = GenerativeModel("gemini-2.0-flash")

    # Start streaming generation
    response = model.generate_content("Explain quantum computing", stream=True)

    # Async iteration over streamed tokens
    async for chunk in response:
        print(chunk.text, end="", flush=True)

if __name__ == "__main__":
    asyncio.run(stream_vertex_ai())
output
Explain quantum computing as a field of study that focuses on the principles of quantum mechanics to perform computation. It leverages quantum bits, or qubits, which can exist in multiple states simultaneously, enabling powerful parallel processing capabilities...

Common variations

  • Use synchronous streaming by iterating over response without async if your environment does not support async.
  • Change the model to other Gemini versions like gemini-2.5-pro by modifying the model name.
  • Adjust location parameter in vertexai.init() to your region.
python
import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.5-pro")

response = model.generate_content("Summarize AI trends in 2026", stream=True)
for chunk in response:
    print(chunk.text, end="", flush=True)
output
AI trends in 2026 include widespread adoption of multimodal models, increased efficiency in training, and integration of AI in everyday applications across industries...

Troubleshooting

  • If you see authentication errors, verify your GOOGLE_APPLICATION_CREDENTIALS path and service account permissions.
  • If streaming does not work, ensure your Python environment supports asyncio and you are using the latest vertexai SDK.
  • Check your Google Cloud project and region settings if the model fails to load.

Key Takeaways

  • Use model.generate_content(stream=true) to enable streaming in Vertex AI.
  • Iterate asynchronously over the response to handle tokens in real time.
  • Set up Google Cloud authentication and project environment variables correctly.
  • You can switch models or regions by changing parameters in vertexai.init() and model name.
  • Troubleshoot common issues by verifying credentials and SDK versions.
Verified 2026-04 · gemini-2.0-flash, gemini-2.5-pro
Verify ↗