How to use Gemini vision in Python
Direct answer
Use the vertexai Python SDK to access Gemini vision by initializing the GenerativeModel with a Gemini vision model and calling generate_content with image inputs in Python.
Setup
Install
pip install vertexai google-cloud-aiplatform Env vars
GOOGLE_CLOUD_PROJECTGOOGLE_APPLICATION_CREDENTIALS Imports
import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image Examples
inImage of a cat
outA description of the cat in the image, including breed and posture.
inPhoto of a city skyline at sunset
outA detailed caption describing the city skyline, time of day, and weather.
inBlurry image with unclear objects
outThe image is blurry and objects are not clearly identifiable.
Integration steps
- Set up Google Cloud project and authenticate with service account JSON via environment variable.
- Install and import the vertexai SDK and initialize it with your project and location.
- Load the Gemini vision model using GenerativeModel with the appropriate model ID.
- Prepare the image input by loading it into memory and converting to bytes.
- Call generate_content with the image bytes as input to get the vision-based response.
- Extract and use the generated text output describing or analyzing the image.
Full code
import os
import vertexai
from vertexai.generative_models import GenerativeModel
import io
from PIL import Image
# Initialize Vertex AI with your GCP project and location
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
# Load Gemini vision model
model = GenerativeModel("gemini-2.0-flash")
# Load an image file into bytes
image_path = "cat_photo.jpg"
with open(image_path, "rb") as f:
image_bytes = f.read()
# Prepare multimodal input with image bytes
response = model.generate_content(
inputs=[{"image": image_bytes}],
temperature=0.2
)
# Print the vision model's description of the image
print("Gemini Vision Output:", response.text) API trace
Request
{"model": "gemini-2.0-flash", "inputs": [{"image": "<base64-encoded-image-bytes>"}], "temperature": 0.2} Response
{"text": "This is a photo of a domestic short-haired cat sitting on a wooden floor...", "metadata": {...}} Extract
response.textVariants
Streaming Gemini Vision Output ›
Use streaming when you want partial outputs as the model generates them for better UX with large or complex images.
import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")
with open("city_skyline.jpg", "rb") as f:
image_bytes = f.read()
stream_response = model.generate_content(
inputs=[{"image": image_bytes}],
temperature=0.3,
stream=True
)
for chunk in stream_response:
print(chunk.text, end="", flush=True) Async Gemini Vision Call ›
Use async calls in applications that require concurrency or non-blocking behavior.
import os
import asyncio
import vertexai
from vertexai.generative_models import GenerativeModel
async def main():
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-2.0-flash")
with open("sunset.jpg", "rb") as f:
image_bytes = f.read()
response = await model.agenerate_content(
inputs=[{"image": image_bytes}],
temperature=0.2
)
print("Async Gemini Vision Output:", response.text)
asyncio.run(main()) Use Gemini-1.5-Pro for Faster Vision ›
Choose Gemini-1.5-Pro for faster inference with slightly less detailed vision understanding.
import os
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project=os.environ["GOOGLE_CLOUD_PROJECT"], location="us-central1")
model = GenerativeModel("gemini-1.5-pro")
with open("object.jpg", "rb") as f:
image_bytes = f.read()
response = model.generate_content(
inputs=[{"image": image_bytes}],
temperature=0.1
)
print("Gemini 1.5 Pro Vision Output:", response.text) Performance
Latency~1.2s per image for gemini-2.0-flash non-streaming
Cost~$0.005 per 1,000 tokens plus image processing fees (check Google Cloud pricing)
Rate limitsDefault Google Cloud Vertex AI limits apply, typically 60 RPM per project
- Keep temperature low (0.1-0.3) to reduce token usage and get concise outputs.
- Limit image size to recommended max (e.g., 4MB) to avoid extra processing delays.
- Batch multiple images in one request if supported to save overhead.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Gemini 2.0 Flash (default) | ~1.2s | ~$0.005 | High-quality detailed vision |
| Gemini 1.5 Pro | ~0.8s | ~$0.003 | Faster, less detailed vision |
| Streaming Output | ~1.2s initial + incremental | ~$0.005 | Interactive apps needing partial results |
Quick tip
Always encode images as bytes and pass them in the inputs array to Gemini vision models using generate_content.
Common mistake
Passing image file paths or URLs directly instead of reading and sending raw image bytes causes errors.