Code beginner · 3 min read

How to use Gemini on Vertex AI in python

Q: How to use Gemini on Vertex AI in python

Use the Google Cloud Vertex AI Python SDK to create a PredictionServiceClient, then call client.predict() with the Gemini model resource name and input instances to get text completions.

Direct answer

Use the Google Cloud Vertex AI Python SDK to create a PredictionServiceClient, then call client.predict() with the Gemini model resource name and input instances to get text completions.

Setup

Install

bash

pip install google-cloud-aiplatform

Env vars

GOOGLE_APPLICATION_CREDENTIALSPROJECT_IDLOCATION

Imports

python

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os
import json

Examples

inGenerate a short poem about spring.

outSpring whispers softly, blooms awake, colors dance in gentle breeze.

inExplain the benefits of using Vertex AI with Gemini models.

outVertex AI offers scalable, managed infrastructure with seamless Gemini model integration for powerful, low-latency AI applications.

inTranslate 'Hello, world!' to French.

outBonjour, le monde !

Integration steps

Set up Google Cloud authentication with service account JSON in environment variable GOOGLE_APPLICATION_CREDENTIALS.
Initialize the Vertex AI PredictionServiceClient in Python.
Construct the full resource name of the Gemini model using project ID, location, and model ID.
Prepare the input instances as a list of dictionaries with the prompt text.
Call the client's predict() method with the model resource name and input instances.
Parse the prediction response to extract the generated text output.

Full code

python

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os
import json

# Set environment variables before running:
# export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
# export PROJECT_ID="your-gcp-project-id"
# export LOCATION="us-central1"

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-1.5-pro"  # Example Gemini model

client = PredictionServiceClient()

# Construct the full resource name of the model
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

# Input prompt for the model
instances = [{"content": "Write a short poem about spring."}]

# Optional parameters for prediction
parameters = {}

response = client.predict(
    endpoint=model_name,
    instances=instances,
    parameters=parameters
)

# The response.predictions is a list of dicts with generated text
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))

output

Generated text: Spring whispers softly, blooms awake, colors dance in gentle breeze.

API trace

Request

json

{"endpoint": "projects/{project_id}/locations/{location}/models/gemini-1.5-pro", "instances": [{"content": "Write a short poem about spring."}], "parameters": {}}

Response

json

{"predictions": [{"content": "Spring whispers softly, blooms awake, colors dance in gentle breeze."}], "deployedModelId": "gemini-1.5-pro", "modelVersionId": "v1"}

Extractresponse.predictions[0]['content']

Variants

Streaming prediction with Gemini on Vertex AI ›

Use when you want to handle large inputs or simulate streaming by chunking requests; native streaming support may require different client or API.

python

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-1.5-pro"

client = PredictionServiceClient()
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

# Streaming is not natively supported in PredictionServiceClient; use async calls or batch for large inputs.
# For demonstration, simulate chunked calls or use async client if available.

instances = [{"content": "Tell me a story about a brave knight."}]
response = client.predict(endpoint=model_name, instances=instances)
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))

Async prediction call with Gemini on Vertex AI ›

Use async client for concurrent or non-blocking calls in applications requiring high throughput or UI responsiveness.

python

import asyncio
from google.cloud.aiplatform.gapic import PredictionServiceAsyncClient
import os

async def async_predict():
    project_id = os.environ["PROJECT_ID"]
    location = os.environ["LOCATION"]
    model_id = "gemini-1.5-pro"

    client = PredictionServiceAsyncClient()
    model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

    instances = [{"content": "Explain quantum computing in simple terms."}]

    response = await client.predict(endpoint=model_name, instances=instances)
    for prediction in response.predictions:
        print("Generated text:", prediction.get("content", ""))

asyncio.run(async_predict())

Use Gemini-2.0-flash model for faster responses ›

Choose this model variant for lower latency and cost when you need faster but slightly less detailed responses.

python

from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-2.0-flash"

client = PredictionServiceClient()
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

instances = [{"content": "Summarize the latest AI trends."}]
response = client.predict(endpoint=model_name, instances=instances)
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))

Performance

Latency~1-2 seconds per request for Gemini-1.5-pro non-streaming

Cost~$0.003 per 1000 tokens generated on Gemini-1.5-pro

Rate limitsDefault quota: 6000 QPS per project, subject to Google Cloud quotas

Limit prompt length to reduce input tokens.
Use shorter output max tokens when possible.
Cache frequent prompts and responses to avoid repeated calls.

Approach	Latency	Cost/call	Best for
Standard predict call	~1-2s	~$0.003/1k tokens	General purpose text generation
Async predict call	~1-2s (non-blocking)	~$0.003/1k tokens	Concurrent or UI apps
Gemini-2.0-flash model	~0.5-1s	~$0.002/1k tokens	Faster, cost-sensitive use cases

✓

Quick tip

Always set the GOOGLE_APPLICATION_CREDENTIALS environment variable to authenticate your Vertex AI client securely.

⚠

Common mistake

Using the model resource name as just the model ID instead of the full resource path causes authentication or not found errors.

Verified 2026-04 · gemini-1.5-pro, gemini-2.0-flash

Verify ↗