High severity intermediate · Fix: 3-8 min

ValueError

ValueError: Images and text must be provided in the correct format for LLaVA processor

What this error means

LLaVA's HuggingFace processor rejects image and text inputs that don't match its expected format structure (list of dicts with 'type' and 'content' keys), causing a ValueError during preprocessing.

Stack trace

traceback

ValueError: Images and text must be provided in the correct format for LLaVA processor. Expected format: [{'type': 'image', 'content': <PIL.Image or tensor>}, {'type': 'text', 'content': 'text string'}]
  File "/venv/lib/python3.11/site-packages/transformers/models/llava/image_processing_llava.py", line 127, in preprocess
    raise ValueError("Images and text must be provided in the correct format for LLaVA processor. Expected format: [{'type': 'image', 'content': <image>}, {'type': 'text', 'content': 'text string'}]")
ValueError: Images and text must be provided in the correct format for LLaVA processor

QUICK FIX

Wrap your image and text in a list of dicts with 'type' and 'content' keys: `processor([{'type': 'image', 'content': image}, {'type': 'text', 'content': prompt}], ...)` instead of passing them separately.

Why it happens

LLaVA's processor.preprocess() expects inputs as a list of dictionaries where each dict contains 'type' ('image' or 'text') and 'content' keys. Developers often pass images and text separately as positional arguments, or use older vision models' APIs expecting (image, text) tuples. The processor is strict about this format because it needs to interleave images and text in the correct token sequence for the LLaVA multimodal architecture.

Detection

Add print statements to inspect your input structure before passing to processor: `print(type(inputs), inputs if isinstance(inputs, list) else 'not a list')`. Use try/except ValueError to catch format errors early and log the raw input for debugging.

Causes & fixes

Passing image and text as separate positional arguments instead of a list of format dicts

✓ Fix

Restructure to: `inputs = processor([{'type': 'image', 'content': image}, {'type': 'text', 'content': prompt}], ...)` instead of `processor(image, text, ...)`

Using dict keys like 'image'/'text' or 'image_content'/'text_content' instead of 'type'/'content'

✓ Fix

Rename all input dicts to use exactly 'type' (value: 'image' or 'text') and 'content' (value: PIL Image or string) as keys

Passing a PIL Image or tensor directly without wrapping in the format dict structure

✓ Fix

Always wrap: `[{'type': 'image', 'content': PIL.Image.open('file.jpg')}, {'type': 'text', 'content': prompt}]`

Mixing old vision model API patterns (like BLIP's processor) with LLaVA's strict format requirements

✓ Fix

Review HuggingFace LLaVA documentation for your exact model version. Use model card example code as template: LLaVA format differs from BLIP, ViLBERT, and other vision models

Code: broken vs fixed

Broken - triggers the error

python

import os
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

# BROKEN: passing image and text as separate positional arguments
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

image = Image.open("photo.jpg")
prompt = "Describe this image in detail."

# This line FAILS with ValueError
inputs = processor(image, prompt, return_tensors="pt")  # ❌ WRONG: separate args

Fixed - works correctly

python

import os
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

# FIXED: using the correct format dict structure
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

image = Image.open("photo.jpg")
prompt = "Describe this image in detail."

# FIXED: wrap image and text in list of dicts with 'type' and 'content' keys
inputs = processor(
    [
        {"type": "image", "content": image},
        {"type": "text", "content": prompt}
    ],
    return_tensors="pt"
)  # ✅ CORRECT format

output_ids = model.generate(**inputs)
result = processor.decode(output_ids[0], skip_special_tokens=True)
print(f"Result: {result}")

Changed from passing image and text as separate positional arguments to wrapping them in a list of format dicts with 'type' ('image'/'text') and 'content' keys, which is LLaVA's required input structure.

⚠

Workaround

If you're locked into a pipeline that expects (image, text) args, create a wrapper function that transforms those args into the required format dict list before calling processor: `def wrap_inputs(image, text): return [{'type': 'image', 'content': image}, {'type': 'text', 'content': text}]` then pass `processor(wrap_inputs(image, prompt), ...)`. This isolates the format conversion in one place.

✓

Prevention

Always consult the model's HuggingFace model card and use their provided example code as your template. LLaVA's processor format differs from other vision-language models. Write a unit test that validates input structure: `assert isinstance(inputs, list)` and `assert all('type' in d and 'content' in d for d in inputs)` before passing to processor. Use type hints to document expected input shape.

Python 3.9+ · transformers >=4.40.0 · tested on 4.43.x

Verified 2026-04 · llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, llava-hf/llava-v1.6-mistral-7b-hf

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.