Code beginner · 3 min read

How to use Gemini vision in Python

Direct answer

Use the vertexai Python SDK to access Gemini vision by initializing the GenerativeModel with a Gemini vision model and calling generate_content with image inputs in Python.

Setup

Install

bash

pip install vertexai google-cloud-aiplatform

Env vars

GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS

Imports

python

import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image

Examples

inImage of a cat

outA description of the cat in the image, including breed and posture.

inPhoto of a city skyline at sunset

outA detailed caption describing the city skyline, time of day, and weather.

inBlurry image with unclear objects

outThe image is blurry and objects are not clearly identifiable.

Integration steps

Set up Google Cloud project and authenticate with service account JSON via environment variable.
Install and import the vertexai SDK and initialize it with your project and location.
Load the Gemini vision model using GenerativeModel with the appropriate model ID.
Prepare the image input by loading it into memory and converting to bytes.
Call generate_content with the image bytes as input to get the vision-based response.
Extract and use the generated text output describing or analyzing the image.

Full code

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image

# Initialize Vertex AI with your GCP project and location
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")

# Load Gemini vision model
model = GenerativeModel("gemini-2.0-flash")

# Load an image file into bytes
image_path = "cat_photo.jpg"
with open(image_path, "rb") as f:
    image_bytes = f.read()

# Prepare multimodal input with image bytes
response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.2
)

# Print the vision model's description of the image
print("Gemini Vision Output:", response.text)

API trace

Request

json

{"model": "gemini-2.0-flash", "inputs": [{"image": "<base64-encoded-image-bytes>"}], "temperature": 0.2}

Response

json

{"text": "This is a photo of a domestic short-haired cat sitting on a wooden floor...", "metadata": {...}}

Extractresponse.text

Variants

Streaming Gemini Vision Output ›

Use streaming when you want partial outputs as the model generates them for better UX with large or complex images.

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")

with open("city_skyline.jpg", "rb") as f:
    image_bytes = f.read()

stream_response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.3,
    stream=True
)

for chunk in stream_response:
    print(chunk.text, end="", flush=True)

Async Gemini Vision Call ›

Use async calls in applications that require concurrency or non-blocking behavior.

python

import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel

async def main():
    vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
    model = GenerativeModel("gemini-2.0-flash")
    with open("sunset.jpg", "rb") as f:
        image_bytes = f.read()
    response = await model.agenerate_content(
        inputs=[{"image": image_bytes}],
        temperature=0.2
    )
    print("Async Gemini Vision Output:", response.text)

asyncio.run(main())

Use Gemini-1.5-Pro for Faster Vision ›

Choose Gemini-1.5-Pro for faster inference with slightly less detailed vision understanding.

python

import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
with open("object.jpg", "rb") as f:
    image_bytes = f.read()
response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.1
)
print("Gemini 1.5 Pro Vision Output:", response.text)

Performance

Latency~1.2s per image for gemini-2.0-flash non-streaming

Cost~$0.005 per 1,000 tokens plus image processing fees (check Google Cloud pricing)

Rate limitsDefault Google Cloud Vertex AI limits apply, typically 60 RPM per project

Keep temperature low (0.1-0.3) to reduce token usage and get concise outputs.
Limit image size to recommended max (e.g., 4MB) to avoid extra processing delays.
Batch multiple images in one request if supported to save overhead.

Approach	Latency	Cost/call	Best for
Gemini 2.0 Flash (default)	~1.2s	~$0.005	High-quality detailed vision
Gemini 1.5 Pro	~0.8s	~$0.003	Faster, less detailed vision
Streaming Output	~1.2s initial + incremental	~$0.005	Interactive apps needing partial results

✓

Quick tip

Always encode images as bytes and pass them in the inputs array to Gemini vision models using generate_content.

⚠

Common mistake

Passing image file paths or URLs directly instead of reading and sending raw image bytes causes errors.

Verified 2026-04 · gemini-2.0-flash, gemini-1.5-pro

Verify ↗