How to Beginner to Intermediate · 3 min read

Vision language models explained

Q: Vision language models explained

Vision language models are multimodal AI systems that process both visual inputs (images) and textual inputs to generate context-aware outputs. They combine computer vision and natural language processing capabilities, enabling tasks like image captioning, visual question answering, and multimodal content generation.

Quick answer

Vision language models are multimodal AI systems that process both visual inputs (images) and textual inputs to generate context-aware outputs. They combine computer vision and natural language processing capabilities, enabling tasks like image captioning, visual question answering, and multimodal content generation.

PREREQUISITES

Python 3.8+
OpenAI API key (free tier works)
pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access vision language models.

bash

pip install openai>=1.0

Step by step

Use the OpenAI SDK to send an image and a prompt to a vision language model like gpt-4o that supports multimodal inputs. The model returns a text response based on the image content and your question.

python

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example: Describe the content of an image
image_url = "https://example.com/image.jpg"

messages = [
    {"role": "user", "content": f"Describe this image: {image_url}"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(response.choices[0].message.content)

output

A scenic mountain landscape with a clear blue sky and a river flowing through the valley.

Common variations

You can use local image files by encoding them as base64 or use different models like gemini-2.5-pro for advanced multimodal tasks. Async calls and streaming outputs are also supported for interactive applications.

python

import os
import base64
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Read and encode local image
with open("./image.png", "rb") as f:
    img_bytes = f.read()
img_b64 = base64.b64encode(img_bytes).decode()

messages = [
    {"role": "user", "content": f"What objects are in this image? data:image/png;base64,{img_b64}"}
]

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=messages
)

print(response.choices[0].message.content)

output

The image contains a dog playing with a ball in a grassy park.

Troubleshooting

If the model does not recognize the image, ensure the image URL is publicly accessible or the base64 encoding is correct.
Check your API key and model name for typos.
For large images, resize or compress to meet API size limits.

✅

Key Takeaways

Vision language models combine image and text inputs for rich multimodal understanding.
Use base64 encoding for local images or public URLs for remote images in API calls.
Models like gpt-4o and gemini-2.5-pro support advanced vision-language tasks.
Always verify API keys, model names, and image accessibility to avoid errors.

Verified 2026-04 · gpt-4o, gemini-2.5-pro

Verify ↗