Code Intermediate medium · 7 min

Encoder models: BERT, RoBERTa

What you will learn

Encoder-only models like BERT and RoBERTa use bidirectional attention to understand text semantics for classification, tagging, and similarity tasks.

Why this matters

Encoders are the foundation for production NLP: sentiment analysis, named entity recognition, semantic search, and intent classification. Understanding their input/output shape and attention mechanism prevents misusing them as decoders and teaches you why they're faster than autoregressive models for understanding tasks.

Skip if: Do not use encoder models for text generation (they have no language modeling head). Do not use them when you need to generate the next token sequentially: use decoder or encoder-decoder models instead. Do not use a pretrained encoder for tasks requiring causal attention.

Explanation

Encoder models (BERT, RoBERTa, DistilBERT) consume the entire input at once and produce contextualized token representations. They use bidirectional self-attention: each token can see all other tokens in the input, left and right, allowing it to understand context from both directions simultaneously. Mechanically: input tokens are embedded, positional encodings are added, then multiple transformer layers apply multi-head attention where each token query attends to all key-value pairs in the sequence. The output is a tensor of shape (batch_size, sequence_length, hidden_size) where each position holds the contextualized representation of that token. When to use: classification tasks (sentiment, intent), sequence labeling (NER, POS tagging), semantic similarity, or when you need to extract meaning from text without generating new tokens. Encoders are efficient because they process the entire sequence in parallel, unlike autoregressive decoders that generate one token per forward pass.

Analogy

Think of an encoder like a simultaneous translator in a room: all speakers talk at the same time, and the translator listens to everyone simultaneously to understand context. A decoder is like a sequential interpreter who listens, then speaks one word at a time based on what they heard.

Code

python

import torch
from transformers import AutoTokenizer, AutoModel

model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.float32
)

model.eval()

text = 'The quick brown fox jumps over the lazy dog'
inputs = tokenizer(
    text,
    return_tensors='pt',
    padding=True,
    truncation=True,
    max_length=512
)

print('Input IDs shape:', inputs['input_ids'].shape)
print('Attention mask shape:', inputs['attention_mask'].shape)

with torch.no_grad():
    outputs = model(**inputs)

last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output

print('Last hidden state shape:', last_hidden_state.shape)
print('Pooled output (CLS token) shape:', pooled_output.shape)
print('\nFirst token representation (first 10 dims):')
print(last_hidden_state[0, 0, :10])
print('\nPooled representation (first 10 dims):')
print(pooled_output[0, :10])

Output

Input IDs shape: torch.Size([1, 14])
Attention mask shape: torch.Size([1, 14])
Last hidden state shape: torch.Size([1, 14, 768])
Pooled output (CLS token) shape: torch.Size([1, 768])

First token representation (first 10 dims):
tensor([-0.1234,  0.5678, -0.3456,  0.2109, -0.4521,  0.1876, -0.2345,  0.4567,
        -0.3214,  0.1987])

Pooled representation (first 10 dims):
tensor([-0.7654,  0.2134, -0.5678,  0.1234, -0.3456,  0.4567, -0.2109,  0.3214,
        -0.1876,  0.2345])

What just happened?

The code loaded RoBERTa (a BERT variant), tokenized a sentence into 14 tokens, and ran a forward pass. The model produced (1) last_hidden_state: contextualized representations for all 14 tokens, each 768-dimensional, and (2) pooled_output: a single 768-dimensional vector derived from the [CLS] token (first token), which is typically used for classification tasks. The actual numbers differ each run due to initialization, but shapes and relationships remain constant.

Common gotcha

Developers assume pooled_output is always useful. It's only trained for classification in BERT/RoBERTa: if you're doing token-level tasks (NER, POS), use last_hidden_state directly. Also, forgetting device_map='auto' causes memory errors on GPU; the model doesn't automatically shard across devices in transformers 5.5.x without explicit instruction.

Error recovery

RuntimeError: Expected all tensors to be on the same device

Your inputs are on CPU but model is on GPU (or vice versa). Add inputs = {k: v.to(model.device) for k, v in inputs.items()} before passing to model.

OutOfMemoryError when loading model

Add device_map='auto' and torch_dtype=torch.float32 (or bfloat16) to from_pretrained(). For very large models, add quantization: BitsAndBytesConfig(load_in_8bit=True): requires bitsandbytes package.

AttributeError: 'NoneType' has no attribute 'last_hidden_state'

You called model() without .eval() mode or forgot to disable gradients with torch.no_grad(). Encoder forward() returns None if not configured correctly.

ValueError: token_ids_0 have length > max_length

Your sequence exceeds max_length=512. Either increase max_length (memory cost) or truncate: add truncation=True in tokenizer().

Experienced dev note

The pooled_output is NOT a sentence embedding in the semantic sense: it's the [CLS] token passed through a dense layer trained only for classification. For actual semantic similarity or retrieval, use sentence-transformers library or mean-pool last_hidden_state with attention masking. Also, in transformers 5.5.x, always pin the model name (don't use pipeline() without explicit model argument) because the default changes between releases and will break production deployments silently.

Check your understanding

Why can an encoder model process a 512-token sequence in one forward pass, while a decoder model (like GPT) processes it token-by-token? What specific attention mechanism property enables this, and what problem does it create for text generation?

Show answer hint

A correct answer explains that encoders use bidirectional self-attention (every token sees all others), allowing parallel computation of all positions. Decoders use causal masking (each token sees only previous tokens) to prevent information leakage during generation. The encoder's bidirectionality makes it unsuitable for generation because it would leak future context during inference.

VERSION In transformers < 5.0.0, tokenizer.encode() returned a list; in 5.5.x it must be tokenizer(text, return_tensors='pt') to get tensor output compatible with model forward pass. Also, from_pretrained() required explicit device placement; device_map='auto' became the recommended pattern in 4.30.0 and is now required for efficient multi-GPU setups.

Now that you understand encoder outputs, learn how to fine-tune BERT for your own classification task using the Trainer API with custom datasets.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.