Code Intermediate medium · 6 min

Vision model loading

What you will learn
Load and initialize LLaMA vision models that process both text and images using transformers or Ollama.

Why this matters

Vision-capable LLaMA models (like LLaVA) let you build multimodal applications that understand images and text together: critical for document analysis, visual Q&A, and accessibility features in production systems.

Skip if: Don't load a vision model if you only need text processing: it consumes 2-3x more VRAM and inference time. Use base LLaMA 3 instead. Also skip vision models if your images are embedded as URLs only; preload and encode them first.

Explanation

What it is: Vision-enabled LLaMA models (typically LLaVA variants built on LLaMA backbone) combine a vision encoder with a language model to process both images and text in a single forward pass. How it works mechanically: When you load a vision model via transformers, the framework automatically detects that the model has an image processor component. You pass both pixel tensors (from `image_processor`) and token IDs (from `tokenizer`) to the model's `generate()` method. The vision encoder extracts image features, projects them into token space, and the LLaMA decoder attends to both image tokens and text tokens. When to use it: Use this when you need to analyze specific image content (charts, photos, screenshots) and ask questions about them in natural language. Ollama simplifies this for edge deployment; transformers gives you fine-grained control over processing.

Analogy

Think of vision loading like adding a second input stream to a telephone call: the language model was built to handle voice (tokens), but now you're also sending video (images). The model needs to decode video into something the language center understands before generating a response.

Code

python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_id = 'llava-hf/llava-1.5-7b-hf'

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16 if device == 'cuda' else torch.float32,
    device_map='auto' if device == 'cuda' else None
)
if device == 'cpu':
    model = model.to(device)

url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')

prompt = "What is in this image? Describe it briefly."
inputs = processor(
    text=prompt,
    images=image,
    return_tensors='pt'
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=False
    )

output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
Output
A stop sign is shown in the image. The sign is red with white lettering that reads 'STOP'. It appears to be positioned on a road or street, typical of how stop signs are used to regulate traffic at intersections.

What just happened?

The code loaded a pretrained vision-language model (LLaVA 1.5 7B) with its corresponding processor. It fetched a test image (Australian stop sign) from a URL, encoded it along with a text prompt using the processor (which internally calls the image processor and tokenizer), moved tensors to GPU if available, ran inference with `generate()` to produce output tokens, and decoded them back to readable text. The model fused image and text representations internally during generation.

Common gotcha

The most common mistake: not calling the same processor object for both encoding and decoding. If you use different tokenizer/image_processor instances, the token indices won't align. Also, forgetting to move input tensors to the same device as the model causes 'expected device cuda:0 but got cpu' errors: always map inputs with `.to(device)` after processing.

Error recovery

RuntimeError: expected ... but got cpu
Your inputs and model are on different devices. After processor() output, add: inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
OutOfMemoryError: CUDA out of memory
Your GPU can't fit the full model. Either use a smaller variant (llava-1.5-7b-hf instead of 13b), reduce torch_dtype to float8, or use device_map='sequential' with offloading. On CPU, expect 3-5x slower inference.
AttributeError: 'NoneType' object has no attribute 'to'
The model failed to load: likely a network issue or missing HF token. Verify internet connection and run: huggingface-cli login with a valid token.
Image.open() throws error
The image URL is broken or unreachable. Verify the URL returns a valid image. For local files, use Image.open('path/to/image.jpg') directly without requests.

Experienced dev note

Vision models are significantly heavier than text-only models: LLaVA 7B in float16 uses ~15GB VRAM, while base LLaMA 7B uses ~5GB. In production, batch process images asynchronously and cache the vision encoder output if you're asking multiple questions about the same image. Also: processor.image_processor.size tells you the expected image resolution: if your images are tiny or massive, preprocessing matters for quality. Finally, LLaVA variants are trained on specific instruction templates; deviating from 'What is in this image?' sometimes reduces coherence.

Check your understanding

If you wanted to process the same image with 5 different questions, why would re-encoding the image 5 times be wasteful, and how would you fix it using the model's architecture?

Show answer hint

A correct answer recognizes that the image encoder (vision_encoder) output is the bottleneck, not the language decoder. The fix involves running the image through the processor once, extracting the image features tensor, and reusing it across multiple text-only forward passes with just new text prompts.

VERSION LLaVA models compatible with transformers 4.36+ (released Dec 2023). In transformers < 4.36, the image_processor integration was incomplete and required manual pixel normalization. LLaMA 3.2-Vision is a newer alternative (llama3.2-vision available via Ollama 0.5+) but transformers support is still rolling out as of April 2026.
NEXT

Batch processing multiple images with vision models to optimize GPU utilization and reduce per-image latency in production systems.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.