Code beginner · 3 min read

How to call Vertex AI Gemini API in Python

Direct answer
Use the vertexai Python SDK to initialize the client and call the Gemini model with model.generate_content() passing your prompt as input.

Setup

Install
bash
pip install google-cloud-aiplatform
Env vars
GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS
Imports
python
import vertexai
from vertexai import language_models
import os

Examples

inExplain quantum computing in simple terms.
outQuantum computing uses quantum bits to perform complex calculations much faster than classical computers.
inWrite a Python function to reverse a string.
outdef reverse_string(s): return s[::-1]
inSummarize the latest AI trends in 2026.
outAI in 2026 focuses on multimodal models, efficient fine-tuning, and real-time reasoning capabilities.

Integration steps

  1. Set up Google Cloud authentication with service account JSON and set environment variables.
  2. Import the vertexai SDK and initialize it with your project and location.
  3. Load the Gemini model using language_models.GenerativeModel().
  4. Call model.generate_content() with your prompt string.
  5. Extract the generated text from the response object.
  6. Print or use the generated content as needed.

Full code

python
import os
import vertexai
from vertexai import language_models

# Set your Google Cloud project and location
project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"

# Initialize the Vertex AI SDK
vertexai.init(project=project_id, location=location)

# Load the Gemini model
model = language_models.GenerativeModel("gemini-2.0-flash")

# Define the prompt
prompt = "Explain quantum computing in simple terms."

# Generate content
response = model.generate_content(prompt)

# Print the generated text
print("Generated response:")
print(response.text)
output
Generated response:
Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.

API trace

Request
json
{"model": "gemini-2.0-flash", "prompt": "Explain quantum computing in simple terms."}
Response
json
{"text": "Quantum computing uses quantum bits to perform complex calculations much faster than classical computers.", "metadata": {...}}
Extractresponse.text

Variants

Streaming response

Use streaming to display partial results immediately for long or interactive responses.

python
import os
import vertexai
from vertexai import language_models

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = language_models.GenerativeModel("gemini-2.0-flash")
prompt = "Explain quantum computing in simple terms."

# Stream the generated content
for chunk in model.generate_content(prompt, stream=True):
    print(chunk.text, end="", flush=True)
Async call

Use async calls to integrate with asynchronous Python applications or frameworks.

python
import os
import asyncio
import vertexai
from vertexai import language_models

async def main():
    project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
    location = "us-central1"
    vertexai.init(project=project_id, location=location)
    model = language_models.GenerativeModel("gemini-2.0-flash")
    prompt = "Explain quantum computing in simple terms."
    response = await model.generate_content(prompt)
    print("Generated response:", response.text)

asyncio.run(main())
Alternative model: gemini-2.5-pro

Use gemini-2.5-pro for higher quality or more complex tasks with slightly higher latency.

python
import os
import vertexai
from vertexai import language_models

project_id = os.environ["GOOGLE_CLOUD_PROJECT"]
location = "us-central1"
vertexai.init(project=project_id, location=location)
model = language_models.GenerativeModel("gemini-2.5-pro")
prompt = "Explain quantum computing in simple terms."
response = model.generate_content(prompt)
print("Generated response:")
print(response.text)

Performance

Latency~800ms for gemini-2.0-flash non-streaming calls
Cost~$0.003 per 500 tokens for gemini-2.0-flash
Rate limitsDefault tier: 300 RPM / 60K TPM
  • Keep prompts concise to reduce token usage.
  • Use streaming to start processing output before full completion.
  • Cache frequent prompts and responses to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard call~800ms~$0.003/500 tokensGeneral purpose, simple integration
Streaming callStarts immediately, total ~800ms~$0.003/500 tokensLong responses, better UX
Async call~800ms~$0.003/500 tokensConcurrent or event-driven apps

Quick tip

Always initialize vertexai with your project and location before loading models to avoid authentication errors.

Common mistake

Forgetting to set the GOOGLE_APPLICATION_CREDENTIALS environment variable with the path to your service account JSON causes authentication failures.

Verified 2026-04 · gemini-2.0-flash, gemini-2.5-pro
Verify ↗