Tool Intermediate medium · 6 min cli_command

Image input format

What you will learn

Format and pass images to vLLM's vision models using URLs, local file paths, or base64-encoded data.

Why this matters

Vision models in vLLM require images in a specific structured format. Passing images incorrectly causes silent failures, malformed requests, or 'unsupported image format' errors that waste inference time and debugging cycles.

Skip if: If your inference pipeline only handles text inputs, skip this entirely. If you're using a text-only model (Llama 2, Mistral 7B), images will be rejected by the model itself.

Explanation

vLLM's vision models (Llava, Qwen-VL, LLaVA-NeXT, etc.) accept images through the image_data field in request payloads. Images must be provided as either URLs (downloaded by vLLM at inference time), base64-encoded strings (for pre-encoded data), or local file paths (resolved server-side). The format you choose depends on your deployment: URLs work well for publicly accessible images, base64 is ideal for client-side encoding before transmission, and file paths are fastest when images are already on the inference server. vLLM automatically detects image format (JPEG, PNG, WebP) and handles resizing/padding. The model's image encoder processes images into tokens that merge with text tokens in the context window: image token count varies by model (Llava uses ~576 tokens per image at 336×336px, newer models may differ). Mixing multiple images in a single request requires careful token budgeting to avoid context overflow.

Configuration

bash

#!/bin/bash

# Start vLLM server with vision model
vllm serve llava-hf/llava-1.5-7b-hf \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95

# Client: Image via URL (async request using curl + jq)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image? Be concise."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Albedo_-_the_rabbit_constellation.jpg/640px-Albedo_-_the_rabbit_constellation.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 128,
    "temperature": 0.7
  }' | jq '.choices[0].message.content'

# Client: Image via base64 (local file encoded before transmission)
IMAGE_PATH="/tmp/test_image.jpg"
BASE64_IMAGE=$(base64 -w 0 < "$IMAGE_PATH")

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d "{
    \"model\": \"llava-hf/llava-1.5-7b-hf\",
    \"messages\": [
      {
        \"role\": \"user\",
        \"content\": [
          {
            \"type\": \"text\",
            \"text\": \"What objects do you see?\"
          },
          {
            \"type\": \"image_url\",
            \"image_url\": {
              \"url\": \"data:image/jpeg;base64,$BASE64_IMAGE\"
            }
          }
        ]
      }
    ],
    \"max_tokens\": 256
  }" | jq '.choices[0].message.content'

# Client: Image via local file path (server-side resolution)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Analyze this document image."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "file:///data/documents/invoice_2024.png"
            }
          }
        ]
      }
    ],
    "max_tokens": 512
  }' | jq '.choices[0].message.content'

# Client: Multiple images in single request (token-aware)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Compare these two images. How are they different?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image1.jpg"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image2.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 256
  }' | jq '.choices[0].message.content'

Why this order?

Start the server first with the vision model. URL-based images are simplest (server fetches them), so show that first. Base64 follows because it requires client-side encoding. File paths come last because they require images to exist on the server. Multiple images show practical batching: but this requires careful token accounting to prevent OOM.

Wrong vs Right

Wrong way

bash

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": "What is in this image? /tmp/photo.jpg"
      }
    ],
    "max_tokens": 128
  }'

# Or passing raw image file path without the 'image_url' structure:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": "Analyze image: /tmp/photo.jpg"
      }
    ]
  }'

Right way

bash

curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "file:///tmp/photo.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 128
  }'

Tool vitals

Primary command

bash

vllm serve <model> --tensor-parallel-size <n>

Config file None: configured via API request JSON

Verify

bash

curl -X POST http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llava-hf/llava-1.5-7b-hf","messages":[{"role":"user","content":[{"type":"text","text":"Describe this image"},{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/a/a7/Camponotus_flavomarginatus_ant.jpg"}}]}],"max_tokens":256}'

Integration notes

vLLM's vision model server is typically deployed alongside text-only LLM inference. Route vision requests to the vision model (llava-1.5-7b-hf) and text-only requests to faster text-only models (Llama 3.2-1B). Use an inference gateway (e.g., Kong, Caddy, or custom FastAPI wrapper) to route requests by model type. For production pipelines using Ray, integrate vLLM via Ray Serve: the image format remains identical, only the client code (Ray actor) changes.

Migration path

If moving away from vLLM: OpenAI's Vision API uses identical JSON structure (image_url with type: 'image_url'), making it a drop-in replacement. Open-source alternatives (Ollama, llama.cpp with vision plugins) use different formats: you'll need to rewrite the message structure. For local-only inference, replacing vLLM with vLLM's Python SDK (LLM class) bypasses HTTP serialization: but still requires the same image_url format internally.

Cost model

vLLM is open-source and free. Costs come from GPU inference hours (if cloud-hosted) and data transfer (downloading remote images from URLs). Base64-encoded images add network payload but no additional cost. File paths are cheapest (local disk I/O only).

Common gotcha

vLLM downloads remote images at request time. If a URL becomes unavailable mid-inference, the request fails after the model has already started processing. Always validate URL accessibility before sending to production. Also: base64-encoded images in the request body can exceed HTTP header size limits (4KB) if not carefully managed: use chunked transfer encoding or pass large images via file:// paths instead. Finally, image token counts are NOT deducted from max_tokens in older vLLM versions: you must manually account for ~576 tokens per image, or you'll silently truncate the response when max_tokens is set too low.

Team adoption

Standardize on one format per team deployment: (1) If all images are public URLs, use image_url format. (2) If processing sensitive company images, pre-encode as base64 client-side and transmit (no intermediate storage). (3) If images live on the inference server, use file:// URLs and document the shared mount point. Create a simple wrapper function (curl CLI or Python requests) that handles the image_url JSON structure: this prevents developers from accidentally passing raw file paths as text.

Experienced dev note

Use file:// URLs for images already on the inference server: they're 10-50x faster than base64 because vLLM memory-maps the file instead of decoding base64. For batch processing, collect multiple images in a single request (not sequential requests) to amortize model loading overhead. Set --image-input-type pixel if you need raw pixel values instead of encoded tokens (rare, for custom vision models).

Check your understanding

You have a batch of 100 product images stored on your inference server at /data/images/*.jpg. Your client code runs on a separate machine. Should you use URLs, base64, or file paths? Why?

Show answer hint

File paths are fastest for server-local images, but the client (on a separate machine) can't access them directly. Use file:// URLs if the server can resolve them, or base64 if the client must encode before transmission. URLs are only viable if your images are already publicly hosted.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.