Vision language models explained
Quick answer
Vision language models are
multimodal AI systems that process both visual inputs (images) and textual inputs to generate context-aware outputs. They combine computer vision and natural language processing capabilities, enabling tasks like image captioning, visual question answering, and multimodal content generation.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable to access vision language models.
pip install openai>=1.0 Step by step
Use the OpenAI SDK to send an image and a prompt to a vision language model like gpt-4o that supports multimodal inputs. The model returns a text response based on the image content and your question.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example: Describe the content of an image
image_url = "https://example.com/image.jpg"
messages = [
{"role": "user", "content": f"Describe this image: {image_url}"}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print(response.choices[0].message.content) output
A scenic mountain landscape with a clear blue sky and a river flowing through the valley.
Common variations
You can use local image files by encoding them as base64 or use different models like gemini-2.5-pro for advanced multimodal tasks. Async calls and streaming outputs are also supported for interactive applications.
import os
import base64
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Read and encode local image
with open("./image.png", "rb") as f:
img_bytes = f.read()
img_b64 = base64.b64encode(img_bytes).decode()
messages = [
{"role": "user", "content": f"What objects are in this image? data:image/png;base64,{img_b64}"}
]
response = client.chat.completions.create(
model="gemini-2.5-pro",
messages=messages
)
print(response.choices[0].message.content) output
The image contains a dog playing with a ball in a grassy park.
Troubleshooting
- If the model does not recognize the image, ensure the image URL is publicly accessible or the base64 encoding is correct.
- Check your API key and model name for typos.
- For large images, resize or compress to meet API size limits.
Key Takeaways
- Vision language models combine image and text inputs for rich multimodal understanding.
- Use base64 encoding for local images or public URLs for remote images in API calls.
- Models like
gpt-4oandgemini-2.5-prosupport advanced vision-language tasks. - Always verify API keys, model names, and image accessibility to avoid errors.