Code Intermediate medium · 7 min

Vision Transformers: ViT

What you will learn

Apply transformer architecture to images by patching them into sequences and classifying visual content end-to-end.

Why this matters

Vision Transformers replace CNN backbones in production vision systems: they're more efficient to train, scale better with data, and integrate seamlessly with multimodal models you'll build later.

Skip if: Don't use ViT for real-time embedded inference on CPU or mobile without quantization: CNNs are still faster there. Also skip ViT if your dataset is tiny (<10k images) and you need to train from scratch; fine-tuning from pretrained ViT is what matters.

Explanation

Vision Transformers (ViT) apply the transformer architecture to images by dividing them into fixed-size patches, flattening each patch into a vector, and processing that sequence through a standard transformer encoder. Unlike CNNs that use local convolutions, ViT treats the entire image as a sequence of tokens: just like text. Mechanically, an image is split into 16×16 or 14×14 patches, each patch is linearly embedded into a high-dimensional space, positional embeddings are added to preserve spatial information, and a learnable [CLS] token is prepended. The transformer encoder then attends across all patches, and the final [CLS] output is passed to a classification head. This design lets the model learn long-range dependencies across the entire image in a single pass, unlike CNNs which build receptive fields incrementally. Use ViT when you have moderate to large datasets (100k+ images), need fine-tuning capabilities, or are building multimodal systems where transformer-to-transformer alignment is critical.

Analogy

If a CNN is like a person scanning a photograph with a magnifying glass (local receptive field expanding gradually), ViT is like a person glancing at the entire image broken into grid squares and instantly considering how all squares relate to each other (global attention from the start).

Code

python

import torch
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests

model_name = 'google/vit-base-patch16-224'
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float32
)
model.eval()

image_url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

inputs = image_processor(image, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
predicted_label = model.config.id2label[predicted_class_idx]
confidence = torch.softmax(logits, dim=-1)[0, predicted_class_idx].item()

print(f'Predicted: {predicted_label}')
print(f'Confidence: {confidence:.4f}')

Output

Predicted: beignet
Confidence: 0.9987

What just happened?

The code downloaded a pretrained ViT-base model (768-dim embeddings, 12 transformer layers, trained on ImageNet-21k), loaded an image processor that resizes and normalizes to 224×224 pixels, converted the image to tensor format, passed it through the model in eval/no-grad mode, extracted the logits from the final classification head, computed softmax probabilities, and printed the predicted class and confidence score.

Common gotcha

Forgetting to set device_map='auto' in from_pretrained() will load the full model on CPU memory first, which fails silently on most GPUs. Also, the image processor must match the model's training resolution: ViT-base expects 224×224, ViT-large expects 384×384. Passing wrong dimensions causes silent misclassification, not an error.

Error recovery

OutOfMemoryError

ViT-large needs 24GB+ VRAM. Use ViT-base (smaller) or add torch_dtype=torch.bfloat16 to reduce memory by half. Quantization via BitsAndBytesConfig also works for inference.

RuntimeError: Expected 3D input

The image_processor output shape is wrong. Verify you called image_processor() with return_tensors='pt', not return_tensors='np'. NumPy arrays won't work with the model.

KeyError in id2label

The model config doesn't have id2label mapping. Use model.config.label2id instead, or explicitly build the label dict from model.config.id2label before accessing it.

Experienced dev note

In transformers 5.5.x, the old pipeline('image-classification') now requires explicit model pinning: pipeline('image-classification', model='google/vit-base-patch16-224'): or it'll silently use whatever the default is that week. Always pin. Second insight: ViT patches are learned globally, so fine-tuning the patch embedding layer is critical for new domains; unfreezing just the classification head loses 5-10% accuracy. Third: batch inference with different image sizes requires padding or resizing in the processor: use image_processor(images, return_tensors='pt', padding=True) for batches, and the processor handles resolution internally if configured.

Check your understanding

If you have a 448×448 image but your image processor is configured for 224×224 (the model was trained at that resolution), what happens when you pass the image through without resizing, and why does this matter for a production classifier?

Show answer hint

The processor automatically resizes to 224×224, which can degrade fine details (you're 2× downsampling). For applications needing detailed recognition (medical imaging, fine-grained classification), you'd need to either retrain ViT at 384×384+ or use a different architecture. The key is understanding that preprocessing mismatch ≠ error; it silently reduces performance.

VERSION transformers >= 5.0.0 changed AutoImageProcessor API: the old vision_image_processor has been unified. If you're on 4.x, use ImageFeatureExtractionPipeline instead. Also, device_map='auto' is mandatory in 5.5.x for proper device placement; 4.x often guessed incorrectly.

Once you can classify images with ViT, learn how to fine-tune a pretrained ViT on custom datasets using the Trainer API for domain-specific performance.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.