Code Intermediate medium · 5 min

CLS token pooling

What you will learn

Extract a fixed-size sentence representation by pooling the [CLS] token from the transformer's output instead of averaging all tokens.

Why this matters

Most classification tasks need a single vector per input sequence, and [CLS] pooling is the standard Hugging Face pattern: misunderstanding it leads to wrong tensor shapes, inconsistent embeddings, and downstream model failures.

Skip if: Don't use [CLS] pooling when you need token-level representations (NER, POS tagging) or when working with models that don't have a [CLS] token (e.g., GPT-style left-to-right models where you'd use the last token instead).

Explanation

CLS token pooling means taking the hidden state of the [CLS] (classification) token: always at position 0 after tokenization: and using it as your sequence representation. This token is special: during pretraining, transformers learn to pack sequence-level information into it, making it a proxy for 'what this input means as a whole.' Mechanically: after running your input through the model, you get a tensor of shape (batch_size, sequence_length, hidden_dim). You extract [:, 0, :]: the first position of every batch element: yielding (batch_size, hidden_dim), which is fed directly into classification heads. This is preferred over mean pooling because the [CLS] token was explicitly trained to aggregate meaning, so it's more semantically coherent. Use it for any task where you need one fixed vector per input: sentiment analysis, entailment, semantic similarity, or as input to downstream classifiers.

Analogy

Think of [CLS] as the 'table of contents' of a book. Rather than averaging summaries of every chapter (mean pooling), you grab the actual table of contents at the front: it was written specifically to represent the entire book's structure.

Code

python

import torch
from transformers import AutoTokenizer, AutoModel

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, device_map='auto')

sentences = [
    'The movie was absolutely fantastic.',
    'I did not enjoy the plot.'
]

encoded = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors='pt',
    max_length=128
)

with torch.no_grad():
    outputs = model(**encoded)

last_hidden_state = outputs.last_hidden_state
print(f'Full output shape: {last_hidden_state.shape}')

cls_embedding = last_hidden_state[:, 0, :]
print(f'CLS embedding shape: {cls_embedding.shape}')
print(f'CLS embedding for first sentence (first 10 dims):\n{cls_embedding[0, :10]}')

Output

Full output shape: torch.Size([2, 21, 768])
CLS embedding shape: torch.Size([2, 768])
CLS embedding for first sentence (first 10 dims):
tensor([ 0.3421, -0.1254,  0.5891, -0.2147,  0.4532, -0.0876,  0.6234, -0.3512,
         0.2198, -0.1634])

What just happened?

We tokenized two sentences into padded token sequences, fed them through BERT, received a 3D tensor where each token has a 768-dim hidden representation, then sliced out position 0 (the [CLS] token) for every batch item, collapsing the sequence dimension and leaving us with a 2D tensor: 2 sentences, each with a 768-dimensional embedding.

Common gotcha

The [CLS] token is always at index 0 only in the tokenized output: if you manually inspect token IDs, don't confuse position 0 with the actual token ID (it's usually 101 in BERT). Also, padding tokens are added to the right, so the [CLS] position never shifts. However, if you extract hidden states before passing through the full model (e.g., via outputs.hidden_states[layer_index]), the [CLS] is still at 0: but its representation is weaker in earlier layers. Always use the final layer.

Error recovery

IndexError: invalid index of a 0-d tensor

You indexed a scalar tensor instead of a batch. Check that <code>last_hidden_state</code> is 3D <code>(batch, seq, hidden)</code>: if it's 2D, you forgot to batch your input or used <code>return_tensors='pt'</code> without batching.

RuntimeError: dimension out of range (expected to be in range of [-2, 1], but got 2)

You're indexing into a 2D tensor (batch, hidden) when you meant to index into 3D. Ensure the model output has not been squeezed and that <code>last_hidden_state.shape</code> is 3D before slicing.

AttributeError: 'BaseModelOutput' object has no attribute 'cls_embedding'

Transformers 5.5.x does not auto-extract [CLS]: you must slice manually with <code>[:, 0, :]</code>. There is no built-in <code>.cls_embedding</code> attribute.

Experienced dev note

In transformers 4.x, many examples used .mean_pooling() utilities or manual slicing inconsistently. In 5.5.x, the idiom is explicit: always slice [:, 0, :] from last_hidden_state. Also, if you're using a model like RoBERTa or ELECTRA, the [CLS] token still exists and works the same way: it's universal across BERT-family models. One subtle but production-critical point: if you're fine-tuning on a classification task, ensure your loss is computed on the [CLS] embedding, not mean pooling, because your pretrained model's [CLS] token has class-aggregating signal baked in from pretraining.

Check your understanding

If you extracted last_hidden_state[:, 0, :] from a batch of 4 sequences encoded at max_length=512 with hidden_dim=768, and then passed this to a 2-class sentiment classifier, what would be the expected input shape to your classifier's first linear layer?

Show answer hint

The shape is (4, 768). The [CLS] pooling reduces the sequence dimension (512 tokens) to a single representation per batch item, leaving batch_size × hidden_dim as your feature matrix for downstream layers.

VERSION In transformers < 4.0, [CLS] extraction was sometimes inconsistent across model types. Transformers 5.5.x standardizes this: all BERT-family models expose .last_hidden_state with position 0 as [CLS], and device_map='auto' is required for proper device placement (previous versions did not default-migrate tensors correctly).

Once you can extract [CLS] embeddings, the natural next step is learning how to attach a classification head on top and fine-tune the entire model end-to-end: which requires understanding gradient flow and the Trainer API.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.