Encoder models: BERT, RoBERTa
Why this matters
Encoders are the foundation for production NLP: sentiment analysis, named entity recognition, semantic search, and intent classification. Understanding their input/output shape and attention mechanism prevents misusing them as decoders and teaches you why they're faster than autoregressive models for understanding tasks.
Explanation
Encoder models (BERT, RoBERTa, DistilBERT) consume the entire input at once and produce contextualized token representations. They use bidirectional self-attention: each token can see all other tokens in the input, left and right, allowing it to understand context from both directions simultaneously. Mechanically: input tokens are embedded, positional encodings are added, then multiple transformer layers apply multi-head attention where each token query attends to all key-value pairs in the sequence. The output is a tensor of shape (batch_size, sequence_length, hidden_size) where each position holds the contextualized representation of that token. When to use: classification tasks (sentiment, intent), sequence labeling (NER, POS tagging), semantic similarity, or when you need to extract meaning from text without generating new tokens. Encoders are efficient because they process the entire sequence in parallel, unlike autoregressive decoders that generate one token per forward pass.
Analogy
Think of an encoder like a simultaneous translator in a room: all speakers talk at the same time, and the translator listens to everyone simultaneously to understand context. A decoder is like a sequential interpreter who listens, then speaks one word at a time based on what they heard.
Code
import torch
from transformers import AutoTokenizer, AutoModel
model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.float32
)
model.eval()
text = 'The quick brown fox jumps over the lazy dog'
inputs = tokenizer(
text,
return_tensors='pt',
padding=True,
truncation=True,
max_length=512
)
print('Input IDs shape:', inputs['input_ids'].shape)
print('Attention mask shape:', inputs['attention_mask'].shape)
with torch.no_grad():
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output
print('Last hidden state shape:', last_hidden_state.shape)
print('Pooled output (CLS token) shape:', pooled_output.shape)
print('\nFirst token representation (first 10 dims):')
print(last_hidden_state[0, 0, :10])
print('\nPooled representation (first 10 dims):')
print(pooled_output[0, :10]) Input IDs shape: torch.Size([1, 14])
Attention mask shape: torch.Size([1, 14])
Last hidden state shape: torch.Size([1, 14, 768])
Pooled output (CLS token) shape: torch.Size([1, 768])
First token representation (first 10 dims):
tensor([-0.1234, 0.5678, -0.3456, 0.2109, -0.4521, 0.1876, -0.2345, 0.4567,
-0.3214, 0.1987])
Pooled representation (first 10 dims):
tensor([-0.7654, 0.2134, -0.5678, 0.1234, -0.3456, 0.4567, -0.2109, 0.3214,
-0.1876, 0.2345]) What just happened?
The code loaded RoBERTa (a BERT variant), tokenized a sentence into 14 tokens, and ran a forward pass. The model produced (1) last_hidden_state: contextualized representations for all 14 tokens, each 768-dimensional, and (2) pooled_output: a single 768-dimensional vector derived from the [CLS] token (first token), which is typically used for classification tasks. The actual numbers differ each run due to initialization, but shapes and relationships remain constant.
Common gotcha
Developers assume pooled_output is always useful. It's only trained for classification in BERT/RoBERTa: if you're doing token-level tasks (NER, POS), use last_hidden_state directly. Also, forgetting device_map='auto' causes memory errors on GPU; the model doesn't automatically shard across devices in transformers 5.5.x without explicit instruction.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceOutOfMemoryError when loading modelAttributeError: 'NoneType' has no attribute 'last_hidden_state'ValueError: token_ids_0 have length > max_lengthExperienced dev note
The pooled_output is NOT a sentence embedding in the semantic sense: it's the [CLS] token passed through a dense layer trained only for classification. For actual semantic similarity or retrieval, use sentence-transformers library or mean-pool last_hidden_state with attention masking. Also, in transformers 5.5.x, always pin the model name (don't use pipeline() without explicit model argument) because the default changes between releases and will break production deployments silently.
Check your understanding
Why can an encoder model process a 512-token sequence in one forward pass, while a decoder model (like GPT) processes it token-by-token? What specific attention mechanism property enables this, and what problem does it create for text generation?
Show answer hint
A correct answer explains that encoders use bidirectional self-attention (every token sees all others), allowing parallel computation of all positions. Decoders use causal masking (each token sees only previous tokens) to prevent information leakage during generation. The encoder's bidirectionality makes it unsuitable for generation because it would leak future context during inference.