How to intermediate · 3 min read

How to use vision models with vLLM

Quick answer
Use vLLM to serve vision-capable models by running a local server with a vision-enabled model checkpoint, then query it via the OpenAI-compatible API with image inputs encoded as base64 or multipart. The vLLM CLI starts the server, and Python clients send requests including images for multimodal inference.

PREREQUISITES

  • Python 3.8+
  • vLLM installed (pip install vllm)
  • A vision-capable model checkpoint compatible with vLLM (e.g., meta-llama/Llama-3.1-8B-Instruct with vision)
  • OpenAI API key (for querying vLLM server via OpenAI-compatible client)
  • pip install openai>=1.0

Setup

Install vLLM and openai Python packages. Prepare a vision-capable model checkpoint locally or download one compatible with vLLM. Set your OpenAI API key in the environment to query the local vLLM server via OpenAI-compatible clients.

bash
pip install vllm openai

Step by step

Start the vLLM server with a vision-enabled model, then send an image prompt encoded in base64 via the OpenAI Python SDK to get multimodal completions.

python
import os
import base64
from openai import OpenAI

# Start vLLM server in terminal (replace model path as needed):
# vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Set environment variable for API key
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "")

# Initialize OpenAI client pointing to local vLLM server
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Load and encode image as base64
with open("example_image.png", "rb") as img_file:
    img_b64 = base64.b64encode(img_file.read()).decode("utf-8")

# Prepare messages with image data embedded
messages = [
    {"role": "user", "content": f"<image>{img_b64}</image> Describe this image."}
]

# Query the vLLM server
response = client.chat.completions.create(
    model="gpt-4o",  # or your vision-enabled model
    messages=messages,
    temperature=0.0
)

print(response.choices[0].message.content)
output
A detailed description of the image content printed here.

Common variations

  • Use different vision-capable models by changing the vllm serve model argument.
  • Send images as multipart/form-data if supported by your client.
  • Use async Python clients with OpenAI SDK for concurrent requests.
python
import asyncio
from openai import OpenAI

async def async_query():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [{"role": "user", "content": "<image>BASE64_IMAGE_DATA</image> What is in this picture?"}]
    response = await client.chat.completions.acreate(
        model="gpt-4o",
        messages=messages
    )
    print(response.choices[0].message.content)

asyncio.run(async_query())
output
Async response with image description.

Troubleshooting

  • If the server does not start, verify the model checkpoint path and compatibility with vision.
  • If image input is not recognized, ensure base64 encoding is correct and wrapped in expected tags like <image>...</image>.
  • Check that the OpenAI client api_key and base_url point correctly to the local vLLM server.

Key Takeaways

  • Run vision-enabled models locally with vLLM using the CLI vllm serve command.
  • Send images as base64-encoded strings embedded in chat messages for multimodal input.
  • Use the OpenAI Python SDK pointed at the vLLM server for seamless API compatibility.
  • Async and streaming calls are supported by the OpenAI SDK for efficient vision model querying.
  • Verify model compatibility and correct environment setup to avoid common errors.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, gpt-4o
Verify ↗