CLS token pooling
Why this matters
Most classification tasks need a single vector per input sequence, and [CLS] pooling is the standard Hugging Face pattern: misunderstanding it leads to wrong tensor shapes, inconsistent embeddings, and downstream model failures.
Explanation
CLS token pooling means taking the hidden state of the [CLS] (classification) token: always at position 0 after tokenization: and using it as your sequence representation. This token is special: during pretraining, transformers learn to pack sequence-level information into it, making it a proxy for 'what this input means as a whole.' Mechanically: after running your input through the model, you get a tensor of shape (batch_size, sequence_length, hidden_dim). You extract [:, 0, :]: the first position of every batch element: yielding (batch_size, hidden_dim), which is fed directly into classification heads. This is preferred over mean pooling because the [CLS] token was explicitly trained to aggregate meaning, so it's more semantically coherent. Use it for any task where you need one fixed vector per input: sentiment analysis, entailment, semantic similarity, or as input to downstream classifiers.
Analogy
Think of [CLS] as the 'table of contents' of a book. Rather than averaging summaries of every chapter (mean pooling), you grab the actual table of contents at the front: it was written specifically to represent the entire book's structure.
Code
import torch
from transformers import AutoTokenizer, AutoModel
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, device_map='auto')
sentences = [
'The movie was absolutely fantastic.',
'I did not enjoy the plot.'
]
encoded = tokenizer(
sentences,
padding=True,
truncation=True,
return_tensors='pt',
max_length=128
)
with torch.no_grad():
outputs = model(**encoded)
last_hidden_state = outputs.last_hidden_state
print(f'Full output shape: {last_hidden_state.shape}')
cls_embedding = last_hidden_state[:, 0, :]
print(f'CLS embedding shape: {cls_embedding.shape}')
print(f'CLS embedding for first sentence (first 10 dims):\n{cls_embedding[0, :10]}') Full output shape: torch.Size([2, 21, 768])
CLS embedding shape: torch.Size([2, 768])
CLS embedding for first sentence (first 10 dims):
tensor([ 0.3421, -0.1254, 0.5891, -0.2147, 0.4532, -0.0876, 0.6234, -0.3512,
0.2198, -0.1634]) What just happened?
We tokenized two sentences into padded token sequences, fed them through BERT, received a 3D tensor where each token has a 768-dim hidden representation, then sliced out position 0 (the [CLS] token) for every batch item, collapsing the sequence dimension and leaving us with a 2D tensor: 2 sentences, each with a 768-dimensional embedding.
Common gotcha
The [CLS] token is always at index 0 only in the tokenized output: if you manually inspect token IDs, don't confuse position 0 with the actual token ID (it's usually 101 in BERT). Also, padding tokens are added to the right, so the [CLS] position never shifts. However, if you extract hidden states before passing through the full model (e.g., via outputs.hidden_states[layer_index]), the [CLS] is still at 0: but its representation is weaker in earlier layers. Always use the final layer.
Error recovery
IndexError: invalid index of a 0-d tensorRuntimeError: dimension out of range (expected to be in range of [-2, 1], but got 2)AttributeError: 'BaseModelOutput' object has no attribute 'cls_embedding'Experienced dev note
In transformers 4.x, many examples used .mean_pooling() utilities or manual slicing inconsistently. In 5.5.x, the idiom is explicit: always slice [:, 0, :] from last_hidden_state. Also, if you're using a model like RoBERTa or ELECTRA, the [CLS] token still exists and works the same way: it's universal across BERT-family models. One subtle but production-critical point: if you're fine-tuning on a classification task, ensure your loss is computed on the [CLS] embedding, not mean pooling, because your pretrained model's [CLS] token has class-aggregating signal baked in from pretraining.
Check your understanding
If you extracted last_hidden_state[:, 0, :] from a batch of 4 sequences encoded at max_length=512 with hidden_dim=768, and then passed this to a 2-class sentiment classifier, what would be the expected input shape to your classifier's first linear layer?
Show answer hint
The shape is (4, 768). The [CLS] pooling reduces the sequence dimension (512 tokens) to a single representation per batch item, leaving batch_size × hidden_dim as your feature matrix for downstream layers.
.last_hidden_state with position 0 as [CLS], and device_map='auto' is required for proper device placement (previous versions did not default-migrate tensors correctly).