Code beginner · 3 min read

How to use Gemini on Vertex AI in python

Direct answer
Use the Google Cloud Vertex AI Python SDK to create a PredictionServiceClient, then call client.predict() with the Gemini model resource name and input instances to get text completions.

Setup

Install
bash
pip install google-cloud-aiplatform
Env vars
GOOGLE_APPLICATION_CREDENTIALSPROJECT_IDLOCATION
Imports
python
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os
import json

Examples

inGenerate a short poem about spring.
outSpring whispers softly, blooms awake, colors dance in gentle breeze.
inExplain the benefits of using Vertex AI with Gemini models.
outVertex AI offers scalable, managed infrastructure with seamless Gemini model integration for powerful, low-latency AI applications.
inTranslate 'Hello, world!' to French.
outBonjour, le monde !

Integration steps

  1. Set up Google Cloud authentication with service account JSON in environment variable GOOGLE_APPLICATION_CREDENTIALS.
  2. Initialize the Vertex AI PredictionServiceClient in Python.
  3. Construct the full resource name of the Gemini model using project ID, location, and model ID.
  4. Prepare the input instances as a list of dictionaries with the prompt text.
  5. Call the client's predict() method with the model resource name and input instances.
  6. Parse the prediction response to extract the generated text output.

Full code

python
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os
import json

# Set environment variables before running:
# export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
# export PROJECT_ID="your-gcp-project-id"
# export LOCATION="us-central1"

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-1.5-pro"  # Example Gemini model

client = PredictionServiceClient()

# Construct the full resource name of the model
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

# Input prompt for the model
instances = [{"content": "Write a short poem about spring."}]

# Optional parameters for prediction
parameters = {}

response = client.predict(
    endpoint=model_name,
    instances=instances,
    parameters=parameters
)

# The response.predictions is a list of dicts with generated text
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))
output
Generated text: Spring whispers softly, blooms awake, colors dance in gentle breeze.

API trace

Request
json
{"endpoint": "projects/{project_id}/locations/{location}/models/gemini-1.5-pro", "instances": [{"content": "Write a short poem about spring."}], "parameters": {}}
Response
json
{"predictions": [{"content": "Spring whispers softly, blooms awake, colors dance in gentle breeze."}], "deployedModelId": "gemini-1.5-pro", "modelVersionId": "v1"}
Extractresponse.predictions[0]['content']

Variants

Streaming prediction with Gemini on Vertex AI

Use when you want to handle large inputs or simulate streaming by chunking requests; native streaming support may require different client or API.

python
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-1.5-pro"

client = PredictionServiceClient()
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

# Streaming is not natively supported in PredictionServiceClient; use async calls or batch for large inputs.
# For demonstration, simulate chunked calls or use async client if available.

instances = [{"content": "Tell me a story about a brave knight."}]
response = client.predict(endpoint=model_name, instances=instances)
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))
Async prediction call with Gemini on Vertex AI

Use async client for concurrent or non-blocking calls in applications requiring high throughput or UI responsiveness.

python
import asyncio
from google.cloud.aiplatform.gapic import PredictionServiceAsyncClient
import os

async def async_predict():
    project_id = os.environ["PROJECT_ID"]
    location = os.environ["LOCATION"]
    model_id = "gemini-1.5-pro"

    client = PredictionServiceAsyncClient()
    model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

    instances = [{"content": "Explain quantum computing in simple terms."}]

    response = await client.predict(endpoint=model_name, instances=instances)
    for prediction in response.predictions:
        print("Generated text:", prediction.get("content", ""))

asyncio.run(async_predict())
Use Gemini-2.0-flash model for faster responses

Choose this model variant for lower latency and cost when you need faster but slightly less detailed responses.

python
from google.cloud import aiplatform
from google.cloud.aiplatform.gapic import PredictionServiceClient
import os

project_id = os.environ["PROJECT_ID"]
location = os.environ["LOCATION"]
model_id = "gemini-2.0-flash"

client = PredictionServiceClient()
model_name = f"projects/{project_id}/locations/{location}/models/{model_id}"

instances = [{"content": "Summarize the latest AI trends."}]
response = client.predict(endpoint=model_name, instances=instances)
for prediction in response.predictions:
    print("Generated text:", prediction.get("content", ""))

Performance

Latency~1-2 seconds per request for Gemini-1.5-pro non-streaming
Cost~$0.003 per 1000 tokens generated on Gemini-1.5-pro
Rate limitsDefault quota: 6000 QPS per project, subject to Google Cloud quotas
  • Limit prompt length to reduce input tokens.
  • Use shorter output max tokens when possible.
  • Cache frequent prompts and responses to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard predict call~1-2s~$0.003/1k tokensGeneral purpose text generation
Async predict call~1-2s (non-blocking)~$0.003/1k tokensConcurrent or UI apps
Gemini-2.0-flash model~0.5-1s~$0.002/1k tokensFaster, cost-sensitive use cases

Quick tip

Always set the GOOGLE_APPLICATION_CREDENTIALS environment variable to authenticate your Vertex AI client securely.

Common mistake

Using the model resource name as just the model ID instead of the full resource path causes authentication or not found errors.

Verified 2026-04 · gemini-1.5-pro, gemini-2.0-flash
Verify ↗