Code beginner · 3 min read

How to set up Vertex AI in Python

Direct answer

Use the vertexai Python SDK to initialize your Google Cloud project and location, then create and call a GenerativeModel instance for text generation.

Setup

Install

bash

pip install google-cloud-aiplatform vertexai

Env vars

GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS

Imports

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel

Examples

inExplain quantum computing

outQuantum computing is a type of computation that uses quantum bits to perform operations exponentially faster than classical computers for certain problems.

inSummarize the benefits of AI in healthcare

outAI in healthcare improves diagnostics, personalizes treatment, automates administrative tasks, and accelerates drug discovery.

outError: Input prompt cannot be empty.

Integration steps

Install the vertexai SDK and set environment variables for authentication.
Initialize the vertexai client with your Google Cloud project and location.
Create a GenerativeModel instance with the desired model name (e.g., 'gemini-2.0-flash').
Call the model's generate_content() method with your input prompt.
Extract the generated text from the response object.

Full code

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel

# Set your Google Cloud project and location
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"

# Initialize the Vertex AI SDK
vertexai.init(project=project_id, location=location)

# Load the Gemini 2.0 Flash model
model = GenerativeModel("gemini-2.0-flash")

# Define the prompt
prompt = "Explain quantum computing"

# Generate content
response = model.generate_content(prompt)

# Print the generated text
print(response.text)

API trace

Request

json

{"model": "gemini-2.0-flash", "prompt": "Explain quantum computing"}

Response

json

{"text": "Quantum computing is a type of computation that uses quantum bits...", "metadata": {...}}

Extractresponse.text

Variants

Streaming response ›

Use streaming to display partial results immediately for long or interactive responses.

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = GenerativeModel("gemini-2.0-flash")
prompt = "Explain quantum computing"

# Stream the generated content
for chunk in model.generate_content(prompt, stream=True):
    print(chunk.text, end="", flush=True)

Async version ›

Use async calls to integrate Vertex AI generation into asynchronous Python applications.

python

import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel

async def main():
    project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
    location = "us-central1"
    vertexai.init(project=project_id, location=location)
    model = GenerativeModel("gemini-2.0-flash")
    prompt = "Explain quantum computing"
    response = await model.generate_content_async(prompt)
    print(response.text)

asyncio.run(main())

Alternative model: Gemini 2.5 Pro ›

Use Gemini 2.5 Pro for higher quality or more complex generation tasks.

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = GenerativeModel("gemini-2.5-pro")
prompt = "Explain quantum computing"
response = model.generate_content(prompt)
print(response.text)

Performance

Latency~800ms for gemini-2.0-flash non-streaming calls

Cost~$0.003 per 500 tokens generated with gemini-2.0-flash

Rate limitsDefault quota: 600 RPM (requests per minute) per project

Use concise prompts to reduce token usage.
Leverage streaming to start processing output early.
Cache frequent queries to avoid repeated calls.

Approach	Latency	Cost/call	Best for
Standard generate_content()	~800ms	~$0.003	Simple synchronous calls
Streaming generate_content(stream=True)	Starts immediately	~$0.003	Long or interactive outputs
Async generate_content_async()	~800ms	~$0.003	Async Python apps

✓

Quick tip

Always initialize vertexai with your project and location before creating model instances.

⚠

Common mistake

Forgetting to set the GOOGLE_APPLICATION_CREDENTIALS environment variable with your service account JSON path.

Verified 2026-04 · gemini-2.0-flash, gemini-2.5-pro

Verify ↗