How to Beginner to Intermediate · 3 min read

Vision language models explained

Quick answer
Vision language models are multimodal AI systems that process both visual inputs (images) and textual inputs to generate context-aware outputs. They combine computer vision and natural language processing capabilities, enabling tasks like image captioning, visual question answering, and multimodal content generation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to access vision language models.

bash
pip install openai>=1.0

Step by step

Use the OpenAI SDK to send an image and a prompt to a vision language model like gpt-4o that supports multimodal inputs. The model returns a text response based on the image content and your question.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example: Describe the content of an image
image_url = "https://example.com/image.jpg"

messages = [
    {"role": "user", "content": f"Describe this image: {image_url}"}
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print(response.choices[0].message.content)
output
A scenic mountain landscape with a clear blue sky and a river flowing through the valley.

Common variations

You can use local image files by encoding them as base64 or use different models like gemini-2.5-pro for advanced multimodal tasks. Async calls and streaming outputs are also supported for interactive applications.

python
import os
import base64
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Read and encode local image
with open("./image.png", "rb") as f:
    img_bytes = f.read()
img_b64 = base64.b64encode(img_bytes).decode()

messages = [
    {"role": "user", "content": f"What objects are in this image? data:image/png;base64,{img_b64}"}
]

response = client.chat.completions.create(
    model="gemini-2.5-pro",
    messages=messages
)

print(response.choices[0].message.content)
output
The image contains a dog playing with a ball in a grassy park.

Troubleshooting

  • If the model does not recognize the image, ensure the image URL is publicly accessible or the base64 encoding is correct.
  • Check your API key and model name for typos.
  • For large images, resize or compress to meet API size limits.

Key Takeaways

  • Vision language models combine image and text inputs for rich multimodal understanding.
  • Use base64 encoding for local images or public URLs for remote images in API calls.
  • Models like gpt-4o and gemini-2.5-pro support advanced vision-language tasks.
  • Always verify API keys, model names, and image accessibility to avoid errors.
Verified 2026-04 · gpt-4o, gemini-2.5-pro
Verify ↗