Image input format
Why this matters
Vision models in vLLM require images in a specific structured format. Passing images incorrectly causes silent failures, malformed requests, or 'unsupported image format' errors that waste inference time and debugging cycles.
Explanation
vLLM's vision models (Llava, Qwen-VL, LLaVA-NeXT, etc.) accept images through the image_data field in request payloads. Images must be provided as either URLs (downloaded by vLLM at inference time), base64-encoded strings (for pre-encoded data), or local file paths (resolved server-side). The format you choose depends on your deployment: URLs work well for publicly accessible images, base64 is ideal for client-side encoding before transmission, and file paths are fastest when images are already on the inference server. vLLM automatically detects image format (JPEG, PNG, WebP) and handles resizing/padding. The model's image encoder processes images into tokens that merge with text tokens in the context window: image token count varies by model (Llava uses ~576 tokens per image at 336×336px, newer models may differ). Mixing multiple images in a single request requires careful token budgeting to avoid context overflow.
Configuration
#!/bin/bash
# Start vLLM server with vision model
vllm serve llava-hf/llava-1.5-7b-hf \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.95
# Client: Image via URL (async request using curl + jq)
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image? Be concise."
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Albedo_-_the_rabbit_constellation.jpg/640px-Albedo_-_the_rabbit_constellation.jpg"
}
}
]
}
],
"max_tokens": 128,
"temperature": 0.7
}' | jq '.choices[0].message.content'
# Client: Image via base64 (local file encoded before transmission)
IMAGE_PATH="/tmp/test_image.jpg"
BASE64_IMAGE=$(base64 -w 0 < "$IMAGE_PATH")
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d "{
\"model\": \"llava-hf/llava-1.5-7b-hf\",
\"messages\": [
{
\"role\": \"user\",
\"content\": [
{
\"type\": \"text\",
\"text\": \"What objects do you see?\"
},
{
\"type\": \"image_url\",
\"image_url\": {
\"url\": \"data:image/jpeg;base64,$BASE64_IMAGE\"
}
}
]
}
],
\"max_tokens\": 256
}" | jq '.choices[0].message.content'
# Client: Image via local file path (server-side resolution)
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Analyze this document image."
},
{
"type": "image_url",
"image_url": {
"url": "file:///data/documents/invoice_2024.png"
}
}
]
}
],
"max_tokens": 512
}' | jq '.choices[0].message.content'
# Client: Multiple images in single request (token-aware)
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images. How are they different?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image1.jpg"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image2.jpg"
}
}
]
}
],
"max_tokens": 256
}' | jq '.choices[0].message.content' Why this order?
Start the server first with the vision model. URL-based images are simplest (server fetches them), so show that first. Base64 follows because it requires client-side encoding. File paths come last because they require images to exist on the server. Multiple images show practical batching: but this requires careful token accounting to prevent OOM.
Wrong vs Right
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": "What is in this image? /tmp/photo.jpg"
}
],
"max_tokens": 128
}'
# Or passing raw image file path without the 'image_url' structure:
curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": "Analyze image: /tmp/photo.jpg"
}
]
}' curl -X POST http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "file:///tmp/photo.jpg"
}
}
]
}
],
"max_tokens": 128
}' Tool vitals
vllm serve <model> --tensor-parallel-size <n> None: configured via API request JSON curl -X POST http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llava-hf/llava-1.5-7b-hf","messages":[{"role":"user","content":[{"type":"text","text":"Describe this image"},{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"}}]}],"max_tokens":256}' Integration notes
vLLM's vision model server is typically deployed alongside text-only LLM inference. Route vision requests to the vision model (llava-1.5-7b-hf) and text-only requests to faster text-only models (Llama 3.2-1B). Use an inference gateway (e.g., Kong, Caddy, or custom FastAPI wrapper) to route requests by model type. For production pipelines using Ray, integrate vLLM via Ray Serve: the image format remains identical, only the client code (Ray actor) changes.
Migration path
If moving away from vLLM: OpenAI's Vision API uses identical JSON structure (image_url with type: 'image_url'), making it a drop-in replacement. Open-source alternatives (Ollama, llama.cpp with vision plugins) use different formats: you'll need to rewrite the message structure. For local-only inference, replacing vLLM with vLLM's Python SDK (LLM class) bypasses HTTP serialization: but still requires the same image_url format internally.
Cost model
vLLM is open-source and free. Costs come from GPU inference hours (if cloud-hosted) and data transfer (downloading remote images from URLs). Base64-encoded images add network payload but no additional cost. File paths are cheapest (local disk I/O only).
Common gotcha
vLLM downloads remote images at request time. If a URL becomes unavailable mid-inference, the request fails after the model has already started processing. Always validate URL accessibility before sending to production. Also: base64-encoded images in the request body can exceed HTTP header size limits (4KB) if not carefully managed: use chunked transfer encoding or pass large images via file:// paths instead. Finally, image token counts are NOT deducted from max_tokens in older vLLM versions: you must manually account for ~576 tokens per image, or you'll silently truncate the response when max_tokens is set too low.
Team adoption
Standardize on one format per team deployment: (1) If all images are public URLs, use image_url format. (2) If processing sensitive company images, pre-encode as base64 client-side and transmit (no intermediate storage). (3) If images live on the inference server, use file:// URLs and document the shared mount point. Create a simple wrapper function (curl CLI or Python requests) that handles the image_url JSON structure: this prevents developers from accidentally passing raw file paths as text.
Experienced dev note
Use file:// URLs for images already on the inference server: they're 10-50x faster than base64 because vLLM memory-maps the file instead of decoding base64. For batch processing, collect multiple images in a single request (not sequential requests) to amortize model loading overhead. Set --image-input-type pixel if you need raw pixel values instead of encoded tokens (rare, for custom vision models).
Check your understanding
You have a batch of 100 product images stored on your inference server at /data/images/*.jpg. Your client code runs on a separate machine. Should you use URLs, base64, or file paths? Why?
Show answer hint
File paths are fastest for server-local images, but the client (on a separate machine) can't access them directly. Use file:// URLs if the server can resolve them, or base64 if the client must encode before transmission. URLs are only viable if your images are already publicly hosted.