Vision model loading
Why this matters
Vision-capable LLaMA models (like LLaVA) let you build multimodal applications that understand images and text together: critical for document analysis, visual Q&A, and accessibility features in production systems.
Explanation
What it is: Vision-enabled LLaMA models (typically LLaVA variants built on LLaMA backbone) combine a vision encoder with a language model to process both images and text in a single forward pass. How it works mechanically: When you load a vision model via transformers, the framework automatically detects that the model has an image processor component. You pass both pixel tensors (from `image_processor`) and token IDs (from `tokenizer`) to the model's `generate()` method. The vision encoder extracts image features, projects them into token space, and the LLaMA decoder attends to both image tokens and text tokens. When to use it: Use this when you need to analyze specific image content (charts, photos, screenshots) and ask questions about them in natural language. Ollama simplifies this for edge deployment; transformers gives you fine-grained control over processing.
Analogy
Think of vision loading like adding a second input stream to a telephone call: the language model was built to handle voice (tokens), but now you're also sending video (images). The model needs to decode video into something the language center understands before generating a response.
Code
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_id = 'llava-hf/llava-1.5-7b-hf'
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
device_map='auto' if device == 'cuda' else None
)
if device == 'cpu':
model = model.to(device)
url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
prompt = "What is in this image? Describe it briefly."
inputs = processor(
text=prompt,
images=image,
return_tensors='pt'
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.inference_mode():
output_ids = model.generate(
**inputs,
max_new_tokens=200,
do_sample=False
)
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text) A stop sign is shown in the image. The sign is red with white lettering that reads 'STOP'. It appears to be positioned on a road or street, typical of how stop signs are used to regulate traffic at intersections.
What just happened?
The code loaded a pretrained vision-language model (LLaVA 1.5 7B) with its corresponding processor. It fetched a test image (Australian stop sign) from a URL, encoded it along with a text prompt using the processor (which internally calls the image processor and tokenizer), moved tensors to GPU if available, ran inference with `generate()` to produce output tokens, and decoded them back to readable text. The model fused image and text representations internally during generation.
Common gotcha
The most common mistake: not calling the same processor object for both encoding and decoding. If you use different tokenizer/image_processor instances, the token indices won't align. Also, forgetting to move input tensors to the same device as the model causes 'expected device cuda:0 but got cpu' errors: always map inputs with `.to(device)` after processing.
Error recovery
RuntimeError: expected ... but got cpuOutOfMemoryError: CUDA out of memoryAttributeError: 'NoneType' object has no attribute 'to'Image.open() throws errorExperienced dev note
Vision models are significantly heavier than text-only models: LLaVA 7B in float16 uses ~15GB VRAM, while base LLaMA 7B uses ~5GB. In production, batch process images asynchronously and cache the vision encoder output if you're asking multiple questions about the same image. Also: processor.image_processor.size tells you the expected image resolution: if your images are tiny or massive, preprocessing matters for quality. Finally, LLaVA variants are trained on specific instruction templates; deviating from 'What is in this image?' sometimes reduces coherence.
Check your understanding
If you wanted to process the same image with 5 different questions, why would re-encoding the image 5 times be wasteful, and how would you fix it using the model's architecture?
Show answer hint
A correct answer recognizes that the image encoder (vision_encoder) output is the bottleneck, not the language decoder. The fix involves running the image through the processor once, extracting the image features tensor, and reusing it across multiple text-only forward passes with just new text prompts.