How to use GPT-4 vision in python
Direct answer
Use the OpenAI Python SDK with the
gpt-4o model specifying an image message type to send images and text together for GPT-4 Vision capabilities.Setup
Install
pip install openai Env vars
OPENAI_API_KEY Imports
import os
from openai import OpenAI Examples
inSend an image of a cat and ask 'What animal is this?'
outThe image shows a cat.
inSend a photo of a street sign and ask 'What does this sign say?'
outThe sign says 'No Parking Between 8 AM and 6 PM.'
inSend a picture of a handwritten note and ask 'What is written here?'
outThe note says 'Meeting at 3 PM tomorrow.'
Integration steps
- Install the OpenAI Python SDK and set your API key in the environment variable OPENAI_API_KEY.
- Import the OpenAI client and initialize it with your API key from os.environ.
- Prepare the messages array including an image message with the image URL or base64 data and a user prompt.
- Call the chat.completions.create method with model='gpt-4o' and the messages array.
- Extract the response text from response.choices[0].message.content and handle it as needed.
Full code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example image URL (replace with your own image URL or base64 data)
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "What is shown in this image?",
"image": {"url": image_url}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Response:", response.choices[0].message.content) output
Response: The image shows a scenic mountain landscape with a lake in the foreground.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is shown in this image?", "image": {"url": "https://example.com/path/to/image.jpg"}}]} Response
{"choices": [{"message": {"content": "The image shows a scenic mountain landscape with a lake in the foreground."}}], "usage": {"total_tokens": 75}} Extract
response.choices[0].message.contentVariants
Streaming GPT-4 Vision Response ›
Use streaming to display partial results immediately for better user experience with long image descriptions.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "Describe this image in detail.",
"image": {"url": image_url}
}
]
response_stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
for chunk in response_stream:
print(chunk.choices[0].delta.get('content', ''), end='') Async GPT-4 Vision Call ›
Use async calls to handle multiple concurrent GPT-4 Vision requests efficiently.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "What is in this image?",
"image": {"url": image_url}
}
]
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=messages
)
print("Async Response:", response.choices[0].message.content)
asyncio.run(main()) Using Base64 Image Data Instead of URL ›
Use base64 image data when you cannot host the image publicly or want to send local images directly.
import os
from openai import OpenAI
import base64
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load image and encode as base64
with open("./image.jpg", "rb") as img_file:
b64_image = base64.b64encode(img_file.read()).decode("utf-8")
messages = [
{
"role": "user",
"content": "What is in this image?",
"image": {"base64": b64_image}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Response:", response.choices[0].message.content) Performance
Latency~1.5s for typical GPT-4 Vision image-text queries
Cost~$0.03 per 1,000 tokens including image processing
Rate limitsTier 1: 300 RPM / 15K TPM for GPT-4 Vision
- Keep prompts concise to reduce token usage.
- Avoid sending very large images; resize before encoding.
- Batch multiple questions in one request to save overhead.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard GPT-4 Vision call | ~1.5s | ~$0.03 | Simple image + text queries |
| Streaming GPT-4 Vision | Starts ~0.5s, streams | ~$0.03 | Long descriptive outputs |
| Async GPT-4 Vision | ~1.5s concurrent | ~$0.03 | High concurrency scenarios |
Quick tip
Always include the image data as an 'image' field inside the user message to enable GPT-4 Vision multimodal input.
Common mistake
Forgetting to include the image data inside the message object causes the model to ignore the image input.