How to intermediate · 3 min read

How to use vision models with vLLM

Q: How to use vision models with vLLM

Use vLLM to serve vision-capable models by running a local server with a vision-enabled model checkpoint, then query it via the OpenAI-compatible API with image inputs encoded as base64 or multipart. The vLLM CLI starts the server, and Python clients send requests including images for multimodal inference.

Quick answer

Use vLLM to serve vision-capable models by running a local server with a vision-enabled model checkpoint, then query it via the OpenAI-compatible API with image inputs encoded as base64 or multipart. The vLLM CLI starts the server, and Python clients send requests including images for multimodal inference.

PREREQUISITES

Python 3.8+
vLLM installed (pip install vllm)
A vision-capable model checkpoint compatible with vLLM (e.g., meta-llama/Llama-3.1-8B-Instruct with vision)
OpenAI API key (for querying vLLM server via OpenAI-compatible client)
pip install openai>=1.0

Setup

Install vLLM and openai Python packages. Prepare a vision-capable model checkpoint locally or download one compatible with vLLM. Set your OpenAI API key in the environment to query the local vLLM server via OpenAI-compatible clients.

bash

pip install vllm openai

Step by step

Start the vLLM server with a vision-enabled model, then send an image prompt encoded in base64 via the OpenAI Python SDK to get multimodal completions.

python

import os
import base64
from openai import OpenAI

# Start vLLM server in terminal (replace model path as needed):
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Set environment variable for API key
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

# Initialize OpenAI client pointing to local vLLM server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load and encode image as base64
with open("example_image.png", "rb") as img_file:
    img_b64 = base64.b64encode(img_file.read()).decode("utf-8")

# Prepare messages with image data embedded
messages = [
    {"role": "user", "content": f"<image>{img_b64}</image> Describe this image."}
]

# Query the vLLM server
response = client.chat.completions.create(
    model="gpt-4o",  # or your vision-enabled model
    messages=messages,
    temperature=0.0
)

print(response.choices[0].message.content)

output

A detailed description of the image content printed here.

Common variations

Use different vision-capable models by changing the vllm serve model argument.
Send images as multipart/form-data if supported by your client.
Use async Python clients with OpenAI SDK for concurrent requests.

python

import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "<image>BASE64_IMAGE_DATA</image> What is in this picture?"}]
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(async_query())

output

Async response with image description.

Troubleshooting

If the server does not start, verify the model checkpoint path and compatibility with vision.
If image input is not recognized, ensure base64 encoding is correct and wrapped in expected tags like <image>...</image>.
Check that the OpenAI client api_key and base_url point correctly to the local vLLM server.

✅

Key Takeaways

Run vision-enabled models locally with vLLM using the CLI vllm serve command.
Send images as base64-encoded strings embedded in chat messages for multimodal input.
Use the OpenAI Python SDK pointed at the vLLM server for seamless API compatibility.
Async and streaming calls are supported by the OpenAI SDK for efficient vision model querying.
Verify model compatibility and correct environment setup to avoid common errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, gpt-4o

Verify ↗