Code beginner · 3 min read

How to call Vertex AI Gemini API in Python

Direct answer

Use the vertexai Python SDK to initialize the client and call the Gemini model with model.generate_content() passing your prompt as input.

Setup

Install

bash

pip install google-cloud-aiplatform

Env vars

GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS

Imports

python

import vertexai
from vertexai import language_models
import os

Examples

inExplain quantum computing in simple terms.

outQuantum computing uses quantum bits to perform complex calculations much faster than classical computers.

inWrite a Python function to reverse a string.

outdef reverse_string(s): return s[::-1]

inSummarize the latest AI trends in 2026.

outAI in 2026 focuses on multimodal models, efficient fine-tuning, and real-time reasoning capabilities.

Integration steps

Set up Google Cloud authentication with service account JSON and set environment variables.
Import the vertexai SDK and initialize it with your project and location.
Load the Gemini model using language_models.GenerativeModel().
Call model.generate_content() with your prompt string.
Extract the generated text from the response object.
Print or use the generated content as needed.

Full code

python

import os
import vertexai
from vertexai import language_models

# Set your Google Cloud project and location
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"

# Initialize the Vertex AI SDK
vertexai.init(project=project_id, location=location)

# Load the Gemini model
model = language_models.GenerativeModel("gemini-2.0-flash")

# Define the prompt
prompt = "Explain quantum computing in simple terms."

# Generate content
response = model.generate_content(prompt)

# Print the generated text
print("Generated response:")
print(response.text)

output

Generated response:
Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.

API trace

Request

json

{"model": "gemini-2.0-flash", "prompt": "Explain quantum computing in simple terms."}

Response

json

{"text": "Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.", "metadata": {...}}

Extractresponse.text

Variants

Streaming response ›

Use streaming to display partial results immediately for long or interactive responses.

python

import os
import vertexai
from vertexai import language_models

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = language_models.GenerativeModel("gemini-2.0-flash")
prompt = "Explain quantum computing in simple terms."

# Stream the generated content
for chunk in model.generate_content(prompt, stream=True):
    print(chunk.text, end="", flush=True)

Async call ›

Use async calls to integrate with asynchronous Python applications or frameworks.

python

import os
import asyncio
import vertexai
from vertexai import language_models

async def main():
    project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
    location = "us-central1"
    vertexai.init(project=project_id, location=location)
    model = language_models.GenerativeModel("gemini-2.0-flash")
    prompt = "Explain quantum computing in simple terms."
    response = await model.generate_content(prompt)
    print("Generated response:", response.text)

asyncio.run(main())

Alternative model: gemini-2.5-pro ›

Use gemini-2.5-pro for higher quality or more complex tasks with slightly higher latency.

python

import os
import vertexai
from vertexai import language_models

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = language_models.GenerativeModel("gemini-2.5-pro")
prompt = "Explain quantum computing in simple terms."
response = model.generate_content(prompt)
print("Generated response:")
print(response.text)

Performance

Latency~800ms for gemini-2.0-flash non-streaming calls

Cost~$0.003 per 500 tokens for gemini-2.0-flash

Rate limitsDefault tier: 300 RPM / 60K TPM

Keep prompts concise to reduce token usage.
Use streaming to start processing output before full completion.
Cache frequent prompts and responses to avoid repeated calls.

Approach	Latency	Cost/call	Best for
Standard call	~800ms	~$0.003/500 tokens	General purpose, simple integration
Streaming call	Starts immediately, total ~800ms	~$0.003/500 tokens	Long responses, better UX
Async call	~800ms	~$0.003/500 tokens	Concurrent or event-driven apps

✓

Quick tip

Always initialize vertexai with your project and location before loading models to avoid authentication errors.

⚠

Common mistake

Forgetting to set the GOOGLE_APPLICATION_CREDENTIALS environment variable with the path to your service account JSON causes authentication failures.

Verified 2026-04 · gemini-2.0-flash, gemini-2.5-pro

Verify ↗