Code beginner · 3 min read

How to use Gemini vision in Python

Direct answer
Use the vertexai Python SDK to access Gemini vision by initializing the GenerativeModel with a Gemini vision model and calling generate_content with image inputs in Python.

Setup

Install
bash
pip install vertexai google-cloud-aiplatform
Env vars
GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS
Imports
python
import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image

Examples

inImage of a cat
outA description of the cat in the image, including breed and posture.
inPhoto of a city skyline at sunset
outA detailed caption describing the city skyline, time of day, and weather.
inBlurry image with unclear objects
outThe image is blurry and objects are not clearly identifiable.

Integration steps

  1. Set up Google Cloud project and authenticate with service account JSON via environment variable.
  2. Install and import the vertexai SDK and initialize it with your project and location.
  3. Load the Gemini vision model using GenerativeModel with the appropriate model ID.
  4. Prepare the image input by loading it into memory and converting to bytes.
  5. Call generate_content with the image bytes as input to get the vision-based response.
  6. Extract and use the generated text output describing or analyzing the image.

Full code

python
import os
import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image

# Initialize Vertex AI with your GCP project and location
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")

# Load Gemini vision model
model = GenerativeModel("gemini-2.0-flash")

# Load an image file into bytes
image_path = "cat_photo.jpg"
with open(image_path, "rb") as f:
    image_bytes = f.read()

# Prepare multimodal input with image bytes
response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.2
)

# Print the vision model's description of the image
print("Gemini Vision Output:", response.text)

API trace

Request
json
{"model": "gemini-2.0-flash", "inputs": [{"image": "<base64-encoded-image-bytes>"}], "temperature": 0.2}
Response
json
{"text": "This is a photo of a domestic short-haired cat sitting on a wooden floor...", "metadata": {...}}
Extractresponse.text

Variants

Streaming Gemini Vision Output

Use streaming when you want partial outputs as the model generates them for better UX with large or complex images.

python
import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")

with open("city_skyline.jpg", "rb") as f:
    image_bytes = f.read()

stream_response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.3,
    stream=True
)

for chunk in stream_response:
    print(chunk.text, end="", flush=True)
Async Gemini Vision Call

Use async calls in applications that require concurrency or non-blocking behavior.

python
import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel

async def main():
    vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
    model = GenerativeModel("gemini-2.0-flash")
    with open("sunset.jpg", "rb") as f:
        image_bytes = f.read()
    response = await model.agenerate_content(
        inputs=[{"image": image_bytes}],
        temperature=0.2
    )
    print("Async Gemini Vision Output:", response.text)

asyncio.run(main())
Use Gemini-1.5-Pro for Faster Vision

Choose Gemini-1.5-Pro for faster inference with slightly less detailed vision understanding.

python
import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
with open("object.jpg", "rb") as f:
    image_bytes = f.read()
response = model.generate_content(
    inputs=[{"image": image_bytes}],
    temperature=0.1
)
print("Gemini 1.5 Pro Vision Output:", response.text)

Performance

Latency~1.2s per image for gemini-2.0-flash non-streaming
Cost~$0.005 per 1,000 tokens plus image processing fees (check Google Cloud pricing)
Rate limitsDefault Google Cloud Vertex AI limits apply, typically 60 RPM per project
  • Keep temperature low (0.1-0.3) to reduce token usage and get concise outputs.
  • Limit image size to recommended max (e.g., 4MB) to avoid extra processing delays.
  • Batch multiple images in one request if supported to save overhead.
ApproachLatencyCost/callBest for
Gemini 2.0 Flash (default)~1.2s~$0.005High-quality detailed vision
Gemini 1.5 Pro~0.8s~$0.003Faster, less detailed vision
Streaming Output~1.2s initial + incremental~$0.005Interactive apps needing partial results

Quick tip

Always encode images as bytes and pass them in the inputs array to Gemini vision models using generate_content.

Common mistake

Passing image file paths or URLs directly instead of reading and sending raw image bytes causes errors.

Verified 2026-04 · gemini-2.0-flash, gemini-1.5-pro
Verify ↗