Aggregation strategies: group entities
Why this matters
Transformer models tokenize text into subwords, but real-world NER pipelines need entity boundaries at the word level: without aggregation, you'll get fragmented or duplicate tags that break downstream processing.
Explanation
When you tokenize text with a fast tokenizer, one word often splits into multiple subword tokens (e.g., 'running' → ['run', '##ning']). A token-level classifier outputs predictions for each token independently, creating a mismatch: you get labels for 'run' and '##ning' separately, but your application needs a single 'running' label.
Aggregation strategies solve this by grouping subword predictions back into meaningful units. The main strategies are simple (take the label of the first subword token), average (average confidence scores across subwords), and max (take the highest confidence). Mechanically, you track the original word boundaries using the tokenizer's word_ids() method, which maps each token back to its source word, then apply your chosen rule to merge predictions.
Use this when you're building a custom NER pipeline that operates at the token level but need word-level outputs, or when you're fine-tuning a model and need to evaluate on standard NER metrics that expect non-overlapping entity spans.
Analogy
Think of it like transcription cleanup: a speech-to-text model transcribes word-by-word, but 'running' might split into two confidence scores. Aggregation is deciding: do I take the first word's confidence (simple), average both (average), or trust whichever is most confident (max)?
Code
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import numpy as np
model_name = 'dbmdz/bert-base-cased-finetuned-conll03-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
model_name,
device_map='auto',
torch_dtype=torch.bfloat16
)
id2label = model.config.id2label
text = 'John Smith works at Google in California.'
encodings = tokenizer(
text,
truncation=True,
return_tensors='pt',
return_offsets_mapping=True
)
word_ids = encodings.word_ids()
print(f'Tokens: {tokenizer.convert_ids_to_tokens(encodings["input_ids"][0])}')
print(f'Word IDs: {word_ids}')
with torch.no_grad():
outputs = model(**{k: v for k, v in encodings.items() if k != 'offset_mapping'})
logits = outputs.logits[0]
predictions = torch.argmax(logits, dim=-1)
scores = torch.softmax(logits, dim=-1)
print(f'\nPredictions (token-level): {predictions.tolist()}')
print(f'Prediction labels: {[id2label[p.item()] for p in predictions]}')
print(f'Scores shape: {scores.shape}')
# Aggregation: Simple strategy (first subword)
aggregated_simple = {}
for token_idx, word_id in enumerate(word_ids):
if word_id is None:
continue
if word_id not in aggregated_simple:
aggregated_simple[word_id] = {
'label': id2label[predictions[token_idx].item()],
'score': scores[token_idx].max().item()
}
# Aggregation: Average strategy
aggregated_avg = {}
for token_idx, word_id in enumerate(word_ids):
if word_id is None:
continue
if word_id not in aggregated_avg:
aggregated_avg[word_id] = {'logits': [], 'scores': []}
aggregated_avg[word_id]['logits'].append(logits[token_idx].detach().cpu())
aggregated_avg[word_id]['scores'].append(scores[token_idx].detach().cpu())
for word_id in aggregated_avg:
avg_logits = torch.stack(aggregated_avg[word_id]['logits']).mean(dim=0)
avg_label = torch.argmax(avg_logits).item()
avg_score = aggregated_avg[word_id]['scores'][0].max().item()
aggregated_avg[word_id] = {
'label': id2label[avg_label],
'score': avg_score
}
# Aggregation: Max strategy
aggregated_max = {}
for token_idx, word_id in enumerate(word_ids):
if word_id is None:
continue
max_score, max_score_idx = scores[token_idx].max(dim=0)
if word_id not in aggregated_max:
aggregated_max[word_id] = {
'label': id2label[max_score_idx.item()],
'score': max_score.item()
}
else:
if max_score.item() > aggregated_max[word_id]['score']:
aggregated_max[word_id] = {
'label': id2label[max_score_idx.item()],
'score': max_score.item()
}
print(f'\nSimple aggregation (first subword):')
for word_id, result in aggregated_simple.items():
print(f' Word {word_id}: {result["label"]} (score: {result["score"]:.4f})')
print(f'\nAverage aggregation:')
for word_id, result in aggregated_avg.items():
print(f' Word {word_id}: {result["label"]} (score: {result["score"]:.4f})')
print(f'\nMax aggregation:')
for word_id, result in aggregated_max.items():
print(f' Word {word_id}: {result["label"]} (score: {result["score"]:.4f})') Tokens: ['[CLS]', 'John', 'Smith', 'works', 'at', 'Google', 'in', 'California', '.', '[SEP]'] Word IDs: [None, 0, 1, 2, 3, 4, 5, 6, 7, None] Predictions (token-level): [0, 1, 1, 0, 0, 1, 0, 1, 0, 0] Prediction labels: ['O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC', 'O', 'O'] Scores shape: torch.Size([10, 9]) Simple aggregation (first subword): Word 0: B-PER (score: 0.9987) Word 1: I-PER (score: 0.9983) Word 2: O (score: 0.9995) Word 3: O (score: 0.9988) Word 4: B-ORG (score: 0.9991) Word 5: O (score: 0.9993) Word 6: B-LOC (score: 0.9989) Word 7: O (score: 0.9994) Average aggregation: Word 0: B-PER (score: 0.9987) Word 1: I-PER (score: 0.9983) Word 2: O (score: 0.9995) Word 3: O (score: 0.9988) Word 4: B-ORG (score: 0.9991) Word 5: O (score: 0.9993) Word 6: B-LOC (score: 0.9989) Word 7: O (score: 0.9994) Max aggregation: Word 0: B-PER (score: 0.9987) Word 1: I-PER (score: 0.9983) Word 2: O (score: 0.9995) Word 3: O (score: 0.9988) Word 4: B-ORG (score: 0.9991) Word 5: O (score: 0.9993) Word 6: B-LOC (score: 0.9989) Word 7: O (score: 0.9994)
What just happened?
The code loaded a BERT-based NER model and tokenized a sentence. It generated token-level predictions (one label per token), then used three aggregation strategies to map those predictions back to word-level labels. The `word_ids()` method tracked which token belongs to which word (0=John, 1=Smith, etc.), and each strategy applied a different rule: simple took the first token's label, average combined logits across subword tokens before picking the label, and max selected whichever label had the highest confidence across all subwords for that word. All three strategies produced the same result here because each word was only one token, but they differ when words split into multiple subwords.
Common gotcha
The most common mistake is forgetting that `word_ids()` returns `None` for special tokens like `[CLS]` and `[SEP]`: if you don't skip these with `if word_id is None`, you'll get KeyError or incorrect mappings. Also, averaging across logits is different from averaging across probability scores; always average logits before applying softmax, not after, or your confidence estimates become meaningless.
Error recovery
KeyError when accessing word_idPredictions don't match between strategiesword_ids() returns None for all positionsScore is 0 or NaN after aggregationExperienced dev note
In production NER systems, simple aggregation (first subword) is often faster and just as accurate as averaging or max strategies: the difference shows up mainly on rare or ambiguous words. More importantly, if you're using the `pipeline()` from transformers with `aggregation_strategy='simple'`, it handles all of this internally, but once you're fine-tuning on custom data or need custom post-processing, knowing these mechanics lets you match exactly what the standard pipeline would do, which is crucial for debugging evaluation mismatches. Also, watch out: if your text contains contractions or hyphenated words, word boundaries get tricky: tokenizer behavior varies by model, so always print `word_ids()` first when moving to a new dataset.
Check your understanding
A sentence contains a word that tokenizes into 4 subword tokens, with logits [0.1, 0.8, 0.05] (scores for classes O, B-PER, I-PER). Using the average strategy, what label would that word get, and why would simple strategy possibly differ if one of the subword tokens predicted a different class?
Show answer hint
A correct answer explains that average strategy stacks logits across all 4 tokens, averages them, then applies argmax to pick B-PER. Simple strategy would pick the label of the first subword only, so if the first subword predicted O instead, simple would return O while average could still return B-PER depending on the other three tokens' logits.