Code beginner · 3 min read

How to use GPT-4o vision in Python

Direct answer
Use the gpt-4o model with the OpenAI Python SDK by sending a chat completion request including both text and image inputs in the messages array to leverage GPT-4o's vision capabilities.

Setup

Install
bash
pip install openai
Env vars
OPENAI_API_KEY
Imports
python
import os
from openai import OpenAI

Examples

inAnalyze the content of this image and describe it.
outThe image shows a scenic mountain landscape with a clear blue sky and a lake in the foreground.
inWhat objects are in this photo?
outThe photo contains a dog playing with a ball in a grassy park.
inIs there any text in this image? If yes, transcribe it.
outYes, the image contains the text 'Welcome to AI Conference 2026'.

Integration steps

  1. Install the OpenAI Python SDK and set your API key in the environment variable OPENAI_API_KEY.
  2. Import OpenAI from the openai package and initialize the client with your API key.
  3. Prepare the messages list including a user message with text and an image URL or base64-encoded image data.
  4. Call client.chat.completions.create with model='gpt-4o' and the prepared messages.
  5. Extract the response text from response.choices[0].message.content to get the model's interpretation of the image.
  6. Handle or display the multimodal output as needed in your application.

Full code

python
import os
from openai import OpenAI

# Initialize client with API key from environment
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Example image URL to analyze
image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the content of this image."},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

print("Model response:")
print(response.choices[0].message.content)

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": [{"type": "text", "text": "Describe the content of this image."}, {"type": "image_url", "image_url": {"url": "https://images.unsplash.com/photo-1506744038136-46273834b3fb"}}]}]}
Response
json
{"choices": [{"message": {"content": "The image depicts a beautiful mountain landscape with a clear blue sky..."}}], "usage": {"total_tokens": 150}}
Extractresponse.choices[0].message.content

Variants

Streaming GPT-4o Vision Response
python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe the content of this image."},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }
]

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    stream=True
)

print("Streaming response:")
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
Async GPT-4o Vision Call
python
import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the content of this image."},
                {"type": "image_url", "image_url": {"url": image_url}}
            ]
        }
    ]
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=messages
    )
    print("Async response:")
    print(response.choices[0].message.content)

asyncio.run(main())
Use GPT-4o-mini Vision Model
python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

image_url = "https://images.unsplash.com/photo-1506744038136-46273834b3fb"

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Summarize this image."},
            {"type": "image_url", "image_url": {"url": image_url}}
        ]
    }
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)

print("Mini model response:")
print(response.choices[0].message.content)

Performance

Latency~1.2 seconds per request for typical image + text input on gpt-4o
Cost~$0.003 per 1K tokens plus a small surcharge for image input on gpt-4o
Rate limitsTier 1: 300 RPM / 18K TPM for gpt-4o
  • Compress or resize images before encoding to reduce payload size.
  • Use concise prompts to minimize token usage.
  • Cache frequent image analyses to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard gpt-4o vision call~1.2s~$0.003 per 1K tokens + imageHigh-quality multimodal understanding
Streaming response~1.2s start + incrementalSame as standardInteractive apps needing token-by-token output
Async call~1.2s concurrentSame as standardConcurrent or async frameworks
gpt-4o-mini vision~0.6s~$0.001 per 1K tokens + imageCost-sensitive or lower latency use cases

Quick tip

Always include the image as a structured object in the <code>messages</code> array with <code>type": "image_url"</code> or <code>type": "image_base64"</code> to enable GPT-4o vision processing.

Common mistake

Beginners often send images as plain text URLs instead of embedding them as structured message content with the correct <code>type</code> field, causing the model to ignore the image.

Verified 2026-04 · gpt-4o, gpt-4o-mini
Verify ↗