Code intermediate · 3 min read

How to build image QA app with Python

Direct answer
Use a multimodal model like gpt-4o with image input support via the chat.completions.create method in the OpenAI SDK to build an image QA app in Python.

Setup

Install
bash
pip install openai pillow
Env vars
OPENAI_API_KEY
Imports
python
import os
from openai import OpenAI
from PIL import Image
import io

Examples

inImage of a cat sitting on a sofa, question: What animal is in the image?
outThe image shows a cat sitting on a sofa.
inImage of a street with cars, question: How many cars are visible?
outThere are three cars visible in the image.
inImage of a handwritten note, question: What does the note say?
outThe note says 'Meeting at 3 PM tomorrow.'

Integration steps

  1. Load and preprocess the image file in Python using Pillow.
  2. Initialize the OpenAI client with the API key from environment variables.
  3. Encode the image as bytes and include it in the messages payload with the question.
  4. Call the chat.completions.create method with the multimodal model supporting images.
  5. Extract the answer from the response's choices[0].message.content field.
  6. Display or return the answer to the user.

Full code

python
import os
from openai import OpenAI
from PIL import Image
import io

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load image and convert to bytes
image_path = "example.jpg"  # Replace with your image path
with open(image_path, "rb") as f:
    image_bytes = f.read()

# Prepare messages with image and question
messages = [
    {"role": "user", "content": "What is shown in this image?"},
    {"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]

# Call multimodal chat completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

# Extract and print answer
answer = response.choices[0].message.content
print("Answer:", answer)
output
Answer: The image shows a cat sitting on a sofa.

API trace

Request
json
{"model": "gpt-4o", "messages": [{"role": "user", "content": "What is shown in this image?"}, {"role": "user", "content": {"type": "image", "image": {"data": "<binary image bytes>"}}}]}
Response
json
{"choices": [{"message": {"content": "The image shows a cat sitting on a sofa."}}], "usage": {"total_tokens": 150}}
Extractresponse.choices[0].message.content

Variants

Streaming response

Use streaming to provide real-time partial answers for better user experience on long responses.

python
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

image_path = "example.jpg"
with open(image_path, "rb") as f:
    image_bytes = f.read()

messages = [
    {"role": "user", "content": "Describe this image."},
    {"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]

stream = client.chat.completions.create(model="gpt-4o", messages=messages, stream=True)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)
print()
Async version

Use async calls when integrating into async web servers or concurrent applications.

python
import os
import asyncio
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    with open("example.jpg", "rb") as f:
        image_bytes = f.read()
    messages = [
        {"role": "user", "content": "What is in this image?"},
        {"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
    ]
    response = await client.chat.completions.create(model="gpt-4o", messages=messages)
    print("Answer:", response.choices[0].message.content)

asyncio.run(main())
Alternative model (Claude multimodal)

Use Anthropic Claude multimodal if you prefer Claude's style or need alternative model capabilities.

python
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["ANTHROPIC_API_KEY"])

image_path = "example.jpg"
with open(image_path, "rb") as f:
    image_bytes = f.read()

messages = [
    {"role": "user", "content": "What is in this image?"},
    {"role": "user", "content": {"type": "image", "image": {"data": image_bytes}}}
]

response = client.chat.completions.create(
    model="claude-3-5-sonnet-20241022",
    messages=messages,
    max_tokens=1024,
    system="You are a helpful assistant."
)
print("Answer:", response.choices[0].message.content)

Performance

Latency~1.5 seconds for a typical image QA request on gpt-4o
Cost~$0.015 per 1,000 tokens plus image processing fees (check current pricing)
Rate limitsTier 1: 300 requests per minute, 20,000 tokens per minute
  • Keep questions concise to reduce token usage.
  • Avoid sending very large images; resize before sending.
  • Cache frequent image queries to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard call (gpt-4o)~1.5s~$0.015General image QA
Streaming responseStarts in ~0.5s~$0.015Interactive apps with long answers
Async call~1.5s~$0.015Concurrent or web server integration
Claude multimodal~1.7sCheck Anthropic pricingAlternative style or compliance needs

Quick tip

Always encode images as bytes and include them in the message content with the correct 'type': 'image' field for multimodal models.

Common mistake

Beginners often forget to encode the image properly or omit the 'type': 'image' wrapper, causing the model to ignore the image input.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗