How to build image QA app with Python
Direct answer
Use a multimodal model like gpt-4o with image input support via the chat.completions.create method in the OpenAI SDK to build an image QA app in Python.
Setup
Install
pip install openai pillow Env vars
OPENAI_API_KEY Imports
import os
from openai import OpenAI
from PIL import Image
import io Examples
inImage of a cat sitting on a sofa, question: What animal is in the image?
outThe image shows a cat sitting on a sofa.
inImage of a street with cars, question: How many cars are visible?
outThere are three cars visible in the image.
inImage of a handwritten note, question: What does the note say?
outThe note says 'Meeting at 3 PM tomorrow.'
Integration steps
- Load and preprocess the image file in Python using Pillow.
- Initialize the OpenAI client with the API key from environment variables.
- Encode the image as bytes and include it in the messages payload with the question.
- Call the chat.completions.create method with the multimodal model supporting images.
- Extract the answer from the response's choices[0].message.content field.
- Display or return the answer to the user.
Full code
import os
from openai import OpenAI
from PIL import Image
import io
# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Load image and convert to bytes
image_path = "example.jpg" # Replace with your image path
with open(image_path, "rb") as f:
image_bytes = f.read()
# Prepare messages with image and question
messages = [
{"role": "user", "content": "What is shown in this image?"},
{"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]
# Call multimodal chat completion
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Extract and print answer
answer = response.choices[0].message.content
print("Answer:", answer) output
Answer: The image shows a cat sitting on a sofa.
API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is shown in this image?"}, {"role": "user", "content": {"type": "image", "image": {"data": "<binary image bytes>"}}}]} Response
{"choices": [{"message": {"content": "The image shows a cat sitting on a sofa."}}], "usage": {"total_tokens": 150}} Extract
response.choices[0].message.contentVariants
Streaming response ›
Use streaming to provide real-time partial answers for better user experience on long responses.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
image_path = "example.jpg"
with open(image_path, "rb") as f:
image_bytes = f.read()
messages = [
{"role": "user", "content": "Describe this image."},
{"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]
stream = client.chat.completions.create(model="gpt-4o", messages=messages, stream=True)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
print() Async version ›
Use async calls when integrating into async web servers or concurrent applications.
import os
import asyncio
from openai import OpenAI
async def main():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
with open("example.jpg", "rb") as f:
image_bytes = f.read()
messages = [
{"role": "user", "content": "What is in this image?"},
{"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]
response = await client.chat.completions.create(model="gpt-4o", messages=messages)
print("Answer:", response.choices[0].message.content)
asyncio.run(main()) Alternative model (Claude multimodal) ›
Use Anthropic Claude multimodal if you prefer Claude's style or need alternative model capabilities.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["ANTHROPIC_API_KEY"])
image_path = "example.jpg"
with open(image_path, "rb") as f:
image_bytes = f.read()
messages = [
{"role": "user", "content": "What is in this image?"},
{"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]
response = client.chat.completions.create(
model="claude-3-5-sonnet-20241022",
messages=messages,
max_tokens=1024,
system="You are a helpful assistant."
)
print("Answer:", response.choices[0].message.content) Performance
Latency~1.5 seconds for a typical image QA request on gpt-4o
Cost~$0.015 per 1,000 tokens plus image processing fees (check current pricing)
Rate limitsTier 1: 300 requests per minute, 20,000 tokens per minute
- Keep questions concise to reduce token usage.
- Avoid sending very large images; resize before sending.
- Cache frequent image queries to avoid repeated calls.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard call (gpt-4o) | ~1.5s | ~$0.015 | General image QA |
| Streaming response | Starts in ~0.5s | ~$0.015 | Interactive apps with long answers |
| Async call | ~1.5s | ~$0.015 | Concurrent or web server integration |
| Claude multimodal | ~1.7s | Check Anthropic pricing | Alternative style or compliance needs |
Quick tip
Always encode images as bytes and include them in the message content with the correct 'type': 'image' field for multimodal models.
Common mistake
Beginners often forget to encode the image properly or omit the 'type': 'image' wrapper, causing the model to ignore the image input.