Vision Transformers: ViT
Why this matters
Vision Transformers replace CNN backbones in production vision systems: they're more efficient to train, scale better with data, and integrate seamlessly with multimodal models you'll build later.
Explanation
Vision Transformers (ViT) apply the transformer architecture to images by dividing them into fixed-size patches, flattening each patch into a vector, and processing that sequence through a standard transformer encoder. Unlike CNNs that use local convolutions, ViT treats the entire image as a sequence of tokens: just like text. Mechanically, an image is split into 16×16 or 14×14 patches, each patch is linearly embedded into a high-dimensional space, positional embeddings are added to preserve spatial information, and a learnable [CLS] token is prepended. The transformer encoder then attends across all patches, and the final [CLS] output is passed to a classification head. This design lets the model learn long-range dependencies across the entire image in a single pass, unlike CNNs which build receptive fields incrementally. Use ViT when you have moderate to large datasets (100k+ images), need fine-tuning capabilities, or are building multimodal systems where transformer-to-transformer alignment is critical.
Analogy
If a CNN is like a person scanning a photograph with a magnifying glass (local receptive field expanding gradually), ViT is like a person glancing at the entire image broken into grid squares and instantly considering how all squares relate to each other (global attention from the start).
Code
import torch
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image
import requests
model_name = 'google/vit-base-patch16-224'
image_processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.float32
)
model.eval()
image_url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
inputs = image_processor(image, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
predicted_label = model.config.id2label[predicted_class_idx]
confidence = torch.softmax(logits, dim=-1)[0, predicted_class_idx].item()
print(f'Predicted: {predicted_label}')
print(f'Confidence: {confidence:.4f}') Predicted: beignet Confidence: 0.9987
What just happened?
The code downloaded a pretrained ViT-base model (768-dim embeddings, 12 transformer layers, trained on ImageNet-21k), loaded an image processor that resizes and normalizes to 224×224 pixels, converted the image to tensor format, passed it through the model in eval/no-grad mode, extracted the logits from the final classification head, computed softmax probabilities, and printed the predicted class and confidence score.
Common gotcha
Forgetting to set device_map='auto' in from_pretrained() will load the full model on CPU memory first, which fails silently on most GPUs. Also, the image processor must match the model's training resolution: ViT-base expects 224×224, ViT-large expects 384×384. Passing wrong dimensions causes silent misclassification, not an error.
Error recovery
OutOfMemoryErrorRuntimeError: Expected 3D inputKeyError in id2labelExperienced dev note
In transformers 5.5.x, the old pipeline('image-classification') now requires explicit model pinning: pipeline('image-classification', model='google/vit-base-patch16-224'): or it'll silently use whatever the default is that week. Always pin. Second insight: ViT patches are learned globally, so fine-tuning the patch embedding layer is critical for new domains; unfreezing just the classification head loses 5-10% accuracy. Third: batch inference with different image sizes requires padding or resizing in the processor: use image_processor(images, return_tensors='pt', padding=True) for batches, and the processor handles resolution internally if configured.
Check your understanding
If you have a 448×448 image but your image processor is configured for 224×224 (the model was trained at that resolution), what happens when you pass the image through without resizing, and why does this matter for a production classifier?
Show answer hint
The processor automatically resizes to 224×224, which can degrade fine details (you're 2× downsampling). For applications needing detailed recognition (medical imaging, fine-grained classification), you'd need to either retrain ViT at 384×384+ or use a different architecture. The key is understanding that preprocessing mismatch ≠ error; it silently reduces performance.