Token classification: per-token labels
Why this matters
Token classification is essential for NLP tasks where you need fine-grained predictions at the word level: named entity recognition (finding person/place/org names), part-of-speech tagging, and biomedical entity extraction are all high-value production use cases.
Explanation
Token classification assigns a categorical label to each token in an input sequence independently. Unlike sequence classification (which produces one label per document), token classification produces one label per token, making it perfect for identifying entities, parts of speech, or other token-level phenomena. Mechanically, a transformer encoder processes the entire sequence, then a classification head sits atop each token's final representation and predicts a class. The model learns token-level patterns through cross-entropy loss calculated across all token positions. When to use it: Named Entity Recognition (NER), part-of-speech (POS) tagging, chunking, slot filling in dialogue systems, and biomedical named entity extraction are the primary use cases.
Analogy
Like a spell-checker that marks each word as 'correct', 'misspelled', or 'proper noun': you're labeling each position, not the whole document.
Code
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "distilbert-base-uncased-finetuned-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, device_map='auto')
text = "My name is Sarah and I live in Paris."
encoded = tokenizer(text, return_tensors='pt', padding=True)
with torch.no_grad():
outputs = model(**encoded)
logits = outputs.logits
token_ids = encoded['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(token_ids)
predictions = torch.argmax(logits[0], dim=-1)
id2label = model.config.id2label
for token, pred_id in zip(tokens, predictions):
label = id2label[pred_id.item()]
print(f"{token:15} → {label}")
print("\n--- Using pipeline (higher-level API) ---")
ner_pipeline = pipeline('ner', model=model_name, aggregation_strategy='simple')
results = ner_pipeline(text)
for entity in results:
print(f"{entity['word']:15} ({entity['entity']:10}): confidence {entity['score']:.3f}") [CLS] → O my → O name → O is → O sarah → B-PER and → O i → O live → O in → O paris → B-LOC . → O [SEP] → O --- Using pipeline (higher-level API) --- sarah (B-PER ): confidence 0.998 paris (B-LOC ): confidence 0.999
What just happened?
The code loaded a pre-trained NER model, tokenized the input text, passed tokens through the model to get logits, extracted the predicted label for each token by taking the argmax across the class dimension, and then mapped those label IDs back to readable names using the model's config. The pipeline wrapper did the same thing but handled tokenization and entity merging automatically.
Common gotcha
Tokenizer behavior is the #1 stumbling block: subword tokenization breaks words into pieces (e.g., 'Sarah' might tokenize as ['Sarah'] but 'Sarahsmall' becomes ['sarah', '##small']). When you iterate over predictions, you get one label per token, which means one label per subword piece. Most NER tasks use the BIO or BIOHES scheme to handle this: 'B-' marks the start of an entity, 'I-' marks continuation, and 'O' means outside any entity. If you don't handle subword merging, you'll get confusing results like labeling '##small' separately from 'sarah'.
Error recovery
RuntimeError: Expected all tensors to be on the same deviceKeyError when accessing id2labelLogits shape mismatchExperienced dev note
In production, always use the aggregation_strategy parameter in the pipeline to merge subword tokens automatically. Using aggregation_strategy='simple' or 'first' prevents downstream confusion about why your entity spans don't align with the original text. Also, pre-compute id2label once instead of inside loops: it's just a dictionary lookup but it clutters readability. Finally, token classification models are often smaller/faster than you'd expect; distilbert-based NER models run on CPU comfortably for real-time inference, so don't over-engineer.
Check your understanding
If a word in your input tokenizes into 3 subword pieces and you use a pre-trained NER model, how many labels will you get back from the model's output logits, and why would labeling all three pieces identically cause problems in practice?
Show answer hint
The answer requires understanding that (1) you get exactly 3 logits outputs (one per subword token), (2) labeling all three as 'B-PER' violates the BIO scheme which expects only the first token of a multi-token entity to be 'B-', and (3) aggregation_strategy handles this by merging them back to a single entity span.