Code Intermediate medium · 6 min

Token classification: per-token labels

What you will learn

Assign a label to every token in a sequence using pre-trained transformer models for tasks like named entity recognition and part-of-speech tagging.

Why this matters

Token classification is essential for NLP tasks where you need fine-grained predictions at the word level: named entity recognition (finding person/place/org names), part-of-speech tagging, and biomedical entity extraction are all high-value production use cases.

Skip if: Don't use token classification when you need a single label for the entire sequence (use sequence classification instead) or when you're doing open-ended generation (use causal language modeling). Also avoid it if your labels are discontinuous or overlapping: transformer models assume each token gets exactly one label.

Explanation

Token classification assigns a categorical label to each token in an input sequence independently. Unlike sequence classification (which produces one label per document), token classification produces one label per token, making it perfect for identifying entities, parts of speech, or other token-level phenomena. Mechanically, a transformer encoder processes the entire sequence, then a classification head sits atop each token's final representation and predicts a class. The model learns token-level patterns through cross-entropy loss calculated across all token positions. When to use it: Named Entity Recognition (NER), part-of-speech (POS) tagging, chunking, slot filling in dialogue systems, and biomedical named entity extraction are the primary use cases.

Analogy

Like a spell-checker that marks each word as 'correct', 'misspelled', or 'proper noun': you're labeling each position, not the whole document.

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "distilbert-base-uncased-finetuned-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, device_map='auto')

text = "My name is Sarah and I live in Paris."

encoded = tokenizer(text, return_tensors='pt', padding=True)

with torch.no_grad():
    outputs = model(**encoded)
    logits = outputs.logits

token_ids = encoded['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
predictions = torch.argmax(logits[0], dim=-1)

id2label = model.config.id2label

for token, pred_id in zip(tokens, predictions):
    label = id2label[pred_id.item()]
    print(f"{token:15} → {label}")

print("\n--- Using pipeline (higher-level API) ---")
ner_pipeline = pipeline('ner', model=model_name, aggregation_strategy='simple')
results = ner_pipeline(text)
for entity in results:
    print(f"{entity['word']:15} ({entity['entity']:10}): confidence {entity['score']:.3f}")

Output

[CLS]           → O
my              → O
name            → O
is              → O
sarah           → B-PER
and             → O
i               → O
live            → O
in              → O
paris           → B-LOC
.               → O
[SEP]           → O

--- Using pipeline (higher-level API) ---
sarah           (B-PER     ): confidence 0.998
paris           (B-LOC     ): confidence 0.999

What just happened?

The code loaded a pre-trained NER model, tokenized the input text, passed tokens through the model to get logits, extracted the predicted label for each token by taking the argmax across the class dimension, and then mapped those label IDs back to readable names using the model's config. The pipeline wrapper did the same thing but handled tokenization and entity merging automatically.

Common gotcha

Tokenizer behavior is the #1 stumbling block: subword tokenization breaks words into pieces (e.g., 'Sarah' might tokenize as ['Sarah'] but 'Sarahsmall' becomes ['sarah', '##small']). When you iterate over predictions, you get one label per token, which means one label per subword piece. Most NER tasks use the BIO or BIOHES scheme to handle this: 'B-' marks the start of an entity, 'I-' marks continuation, and 'O' means outside any entity. If you don't handle subword merging, you'll get confusing results like labeling '##small' separately from 'sarah'.

Error recovery

RuntimeError: Expected all tensors to be on the same device

The model is on a different device than your tensors. Use encoded = tokenizer(text, return_tensors='pt').to(model.device) or better yet, use device_map='auto' when loading the model.

KeyError when accessing id2label

The model config doesn't have id2label defined. Ensure you're using a token classification model, not a causal language model. Check model.config.id2label exists before accessing it.

Logits shape mismatch

Logits have shape [batch_size, sequence_length, num_labels]. If you're processing single examples, remember to index [0] for the batch dimension first: torch.argmax(logits[0], dim=-1).

Experienced dev note

In production, always use the aggregation_strategy parameter in the pipeline to merge subword tokens automatically. Using aggregation_strategy='simple' or 'first' prevents downstream confusion about why your entity spans don't align with the original text. Also, pre-compute id2label once instead of inside loops: it's just a dictionary lookup but it clutters readability. Finally, token classification models are often smaller/faster than you'd expect; distilbert-based NER models run on CPU comfortably for real-time inference, so don't over-engineer.

Check your understanding

If a word in your input tokenizes into 3 subword pieces and you use a pre-trained NER model, how many labels will you get back from the model's output logits, and why would labeling all three pieces identically cause problems in practice?

Show answer hint

The answer requires understanding that (1) you get exactly 3 logits outputs (one per subword token), (2) labeling all three as 'B-PER' violates the BIO scheme which expects only the first token of a multi-token entity to be 'B-', and (3) aggregation_strategy handles this by merging them back to a single entity span.

VERSION transformers 5.x changed the default tokenizer behavior for batch processing: always use return_tensors='pt' explicitly. The pipeline API is stable, but device_map='auto' is required in 5.x for multi-GPU or quantized models (it's optional in 4.x). Also note: AutoModelForTokenClassification in 5.x no longer accepts load_in_8bit directly; use BitsAndBytesConfig instead.

Now that you can label individual tokens, learn how to fine-tune a token classification model on your own labeled dataset using the Trainer API to adapt a pre-trained model to your specific domain or language.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.