How to Intermediate · 3 min read

How to use Phi-3 vision

Quick answer
Use the Phi-3 vision model via the OpenAI API by sending image inputs along with text prompts in a chat completion request. The model processes both visual and textual data, enabling multimodal understanding and generation.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the official openai Python package and set your API key as an environment variable.

  • Run pip install openai to install the SDK.
  • Set your API key in your shell: export OPENAI_API_KEY='your_api_key' (Linux/macOS) or setx OPENAI_API_KEY "your_api_key" (Windows).
bash
pip install openai

Step by step

Send an image file and a text prompt to the Phi-3 vision model using the OpenAI Python SDK. The model accepts images as base64-encoded strings or file bytes in the messages array with image_url or image content type.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load image bytes
with open("example.jpg", "rb") as f:
    image_bytes = f.read()

messages = [
    {"role": "user", "content": "Describe this image."},
    {"role": "user", "content": {"image": image_bytes}}
]

response = client.chat.completions.create(
    model="phi-3-vision",
    messages=messages
)

print(response.choices[0].message.content)
output
A scenic mountain landscape with a clear blue sky and a lake reflecting the mountains.

Common variations

You can use streaming to receive partial outputs as the model generates them, or use async calls for concurrency. Also, you can combine text and multiple images in one request for richer multimodal tasks.

python
import asyncio
import os
from openai import OpenAI

async def main():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    with open("example.jpg", "rb") as f:
        image_bytes = f.read()

    messages = [
        {"role": "user", "content": "What objects are in this image?"},
        {"role": "user", "content": {"image": image_bytes}}
    ]

    stream = await client.chat.completions.create(
        model="phi-3-vision",
        messages=messages,
        stream=True
    )

    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)

if __name__ == "__main__":
    asyncio.run(main())
output
Objects detected: mountain, lake, sky, trees.

Troubleshooting

  • If you get an authentication error, verify your OPENAI_API_KEY environment variable is set correctly.
  • If the image is not recognized, ensure it is a supported format (JPEG, PNG) and not too large (under 4MB recommended).
  • For unexpected errors, check the model name spelling and API version compatibility.

Key Takeaways

  • Use the phi-3-vision model with the OpenAI Python SDK for multimodal image and text tasks.
  • Send images as bytes in the messages array alongside text prompts for context.
  • Streaming and async calls enable efficient handling of large or multiple multimodal requests.
Verified 2026-04 · phi-3-vision
Verify ↗