ValueError
ValueError: Images and text must be provided in the correct format for LLaVA processor
Stack trace
ValueError: Images and text must be provided in the correct format for LLaVA processor. Expected format: [{'type': 'image', 'content': <PIL.Image or tensor>}, {'type': 'text', 'content': 'text string'}]
File "/venv/lib/python3.11/site-packages/transformers/models/llava/image_processing_llava.py", line 127, in preprocess
raise ValueError("Images and text must be provided in the correct format for LLaVA processor. Expected format: [{'type': 'image', 'content': <image>}, {'type': 'text', 'content': 'text string'}]")
ValueError: Images and text must be provided in the correct format for LLaVA processor Why it happens
LLaVA's processor.preprocess() expects inputs as a list of dictionaries where each dict contains 'type' ('image' or 'text') and 'content' keys. Developers often pass images and text separately as positional arguments, or use older vision models' APIs expecting (image, text) tuples. The processor is strict about this format because it needs to interleave images and text in the correct token sequence for the LLaVA multimodal architecture.
Detection
Add print statements to inspect your input structure before passing to processor: `print(type(inputs), inputs if isinstance(inputs, list) else 'not a list')`. Use try/except ValueError to catch format errors early and log the raw input for debugging.
Causes & fixes
Passing image and text as separate positional arguments instead of a list of format dicts
Restructure to: `inputs = processor([{'type': 'image', 'content': image}, {'type': 'text', 'content': prompt}], ...)` instead of `processor(image, text, ...)`
Using dict keys like 'image'/'text' or 'image_content'/'text_content' instead of 'type'/'content'
Rename all input dicts to use exactly 'type' (value: 'image' or 'text') and 'content' (value: PIL Image or string) as keys
Passing a PIL Image or tensor directly without wrapping in the format dict structure
Always wrap: `[{'type': 'image', 'content': PIL.Image.open('file.jpg')}, {'type': 'text', 'content': prompt}]`
Mixing old vision model API patterns (like BLIP's processor) with LLaVA's strict format requirements
Review HuggingFace LLaVA documentation for your exact model version. Use model card example code as template: LLaVA format differs from BLIP, ViLBERT, and other vision models
Code: broken vs fixed
import os
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
# BROKEN: passing image and text as separate positional arguments
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
image = Image.open("photo.jpg")
prompt = "Describe this image in detail."
# This line FAILS with ValueError
inputs = processor(image, prompt, return_tensors="pt") # ❌ WRONG: separate args import os
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
# FIXED: using the correct format dict structure
model_id = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
image = Image.open("photo.jpg")
prompt = "Describe this image in detail."
# FIXED: wrap image and text in list of dicts with 'type' and 'content' keys
inputs = processor(
[
{"type": "image", "content": image},
{"type": "text", "content": prompt}
],
return_tensors="pt"
) # ✅ CORRECT format
output_ids = model.generate(**inputs)
result = processor.decode(output_ids[0], skip_special_tokens=True)
print(f"Result: {result}") Workaround
If you're locked into a pipeline that expects (image, text) args, create a wrapper function that transforms those args into the required format dict list before calling processor: `def wrap_inputs(image, text): return [{'type': 'image', 'content': image}, {'type': 'text', 'content': text}]` then pass `processor(wrap_inputs(image, prompt), ...)`. This isolates the format conversion in one place.
Prevention
Always consult the model's HuggingFace model card and use their provided example code as your template. LLaVA's processor format differs from other vision-language models. Write a unit test that validates input structure: `assert isinstance(inputs, list)` and `assert all('type' in d and 'content' in d for d in inputs)` before passing to processor. Use type hints to document expected input shape.