How to use GPT-4 vision in python
Direct answer
Use the OpenAI Python SDK with the
gpt-4o model specifying an image message type to send images and text together for GPT-4 Vision capabilities.Setup
Install
pip install openai Env vars
OPENAI_API_KEY Imports
import os
from openai import OpenAI Examples
inSend an image of a cat and ask 'What animal is this?'
outThe image shows a cat.
inSend a photo of a street sign and ask 'What does this sign say?'
outThe sign says 'No Parking Between 8 AM and 6 PM.'
inSend a picture of a handwritten note and ask 'What is written here?'
outThe note says 'Meeting at 3 PM tomorrow.'
Integration steps
- Install the OpenAI Python SDK and set your API key in the environment variable OPENAI_API_KEY.
- Import the OpenAI client and initialize it with your API key from os.environ.
- Prepare the messages array including an image message with the image URL or base64 data and a user prompt.
- Call the chat.completions.create method with model='gpt-4o' and the messages array.
- Extract the response text from response.choices[0].message.content and handle it as needed.
Full code
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example image URL (replace with your own image URL or base64 data)
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "What is shown in this image?",
"image": {"url": image_url}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Response:", response.choices[0].message.content) output
Response: The image shows a scenic mountain landscape with a lake in the foreground.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is shown in this image?", "image": {"url": "https://example.com/path/to/image.jpg"}}]} Response
{"choices": [{"message": {"content": "The image shows a scenic mountain landscape with a lake in the foreground."}}], "usage": {"total_tokens": 75}} Extract
response.choices[0].message.contentVariants
Streaming GPT-4 Vision Response ›
Use streaming to display partial results immediately for better user experience with long image descriptions.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "Describe this image in detail.",
"image": {"url": image_url}
}
]
response_stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True
)
for chunk in response_stream:
print(chunk.choices[0].delta.get('content', ''), end='') Async GPT-4 Vision Call ›
Use async calls to handle multiple concurrent GPT-4 Vision requests efficiently.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
image_url = "https://example.com/path/to/image.jpg"
messages = [
{
"role": "user",
"content": "What is in this image?",
"image": {"url": image_url}
}
]
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=messages
)
print("Async Response:", response.choices[0].message.content)
asyncio.run(main()) Using Base64 Image Data Instead of URL ›
Use base64 image data when you cannot host the image publicly or want to send local images directly.
import os
from openai import OpenAI
import base64
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load image and encode as base64
with open("./image.jpg", "rb") as img_file:
b64_image = base64.b64encode(img_file.read()).decode("utf-8")
messages = [
{
"role": "user",
"content": "What is in this image?",
"image": {"base64": b64_image}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
print("Response:", response.choices[0].message.content) Performance
Latency~1.5s for typical GPT-4 Vision image-text queries
Cost~$0.03 per 1,000 tokens including image processing
Rate limitsTier 1: 300 RPM / 15K TPM for GPT-4 Vision
- Keep prompts concise to reduce token usage.
- Avoid sending very large images; resize before encoding.
- Batch multiple questions in one request to save overhead.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard GPT-4 Vision call | ~1.5s | ~$0.03 | Simple image + text queries |
| Streaming GPT-4 Vision | Starts ~0.5s, streams | ~$0.03 | Long descriptive outputs |
| Async GPT-4 Vision | ~1.5s concurrent | ~$0.03 | High concurrency scenarios |
Quick tip
Always include the image data as an 'image' field inside the user message to enable GPT-4 Vision multimodal input.
Common mistake
Forgetting to include the image data inside the message object causes the model to ignore the image input.
Community Notes
No notes yetBe the first to share a version-specific fix or tip.