Code Intermediate medium · 6 min

Aggregation strategies: group entities

What you will learn

Combine subword token predictions back into word-level or span-level entity tags when working with tokenized text.

Why this matters

Transformer models tokenize text into subwords, but real-world NER pipelines need entity boundaries at the word level: without aggregation, you'll get fragmented or duplicate tags that break downstream processing.

Skip if: If your downstream task operates directly on tokens (like attention visualization) or if you're using a pre-aggregated pipeline() call that handles this internally, you don't need manual aggregation.

Explanation

When you tokenize text with a fast tokenizer, one word often splits into multiple subword tokens (e.g., 'running' → ['run', '##ning']). A token-level classifier outputs predictions for each token independently, creating a mismatch: you get labels for 'run' and '##ning' separately, but your application needs a single 'running' label.

Aggregation strategies solve this by grouping subword predictions back into meaningful units. The main strategies are simple (take the label of the first subword token), average (average confidence scores across subwords), and max (take the highest confidence). Mechanically, you track the original word boundaries using the tokenizer's word_ids() method, which maps each token back to its source word, then apply your chosen rule to merge predictions.

Use this when you're building a custom NER pipeline that operates at the token level but need word-level outputs, or when you're fine-tuning a model and need to evaluate on standard NER metrics that expect non-overlapping entity spans.

Analogy

Think of it like transcription cleanup: a speech-to-text model transcribes word-by-word, but 'running' might split into two confidence scores. Aggregation is deciding: do I take the first word's confidence (simple), average both (average), or trust whichever is most confident (max)?

Code

python

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import numpy as np

model_name = 'dbmdz/bert-base-cased-finetuned-conll03-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    device_map='auto',
    torch_dtype=torch.bfloat16
)

id2label = model.config.id2label

text = 'John Smith works at Google in California.'

encodings = tokenizer(
    text,
    truncation=True,
    return_tensors='pt',
    return_offsets_mapping=True
)

word_ids = encodings.word_ids()
print(f'Tokens: {tokenizer.convert_ids_to_tokens(encodings["input_ids"][0])}')
print(f'Word IDs: {word_ids}')

with torch.no_grad():
    outputs = model(**{k: v for k, v in encodings.items() if k != 'offset_mapping'})
    logits = outputs.logits[0]

predictions = torch.argmax(logits, dim=-1)
scores = torch.softmax(logits, dim=-1)

print(f'\nPredictions (token-level): {predictions.tolist()}')
print(f'Prediction labels: {[id2label[p.item()] for p in predictions]}')
print(f'Scores shape: {scores.shape}')

# Aggregation: Simple strategy (first subword)
aggregated_simple = {}
for token_idx, word_id in enumerate(word_ids):
    if word_id is None:
        continue
    if word_id not in aggregated_simple:
        aggregated_simple[word_id] = {
            'label': id2label[predictions[token_idx].item()],
            'score': scores[token_idx].max().item()
        }

# Aggregation: Average strategy
aggregated_avg = {}
for token_idx, word_id in enumerate(word_ids):
    if word_id is None:
        continue
    if word_id not in aggregated_avg:
        aggregated_avg[word_id] = {'logits': [], 'scores': []}
    aggregated_avg[word_id]['logits'].append(logits[token_idx].detach().cpu())
    aggregated_avg[word_id]['scores'].append(scores[token_idx].detach().cpu())

for word_id in aggregated_avg:
    avg_logits = torch.stack(aggregated_avg[word_id]['logits']).mean(dim=0)
    avg_label = torch.argmax(avg_logits).item()
    avg_score = aggregated_avg[word_id]['scores'][0].max().item()
    aggregated_avg[word_id] = {
        'label': id2label[avg_label],
        'score': avg_score
    }

# Aggregation: Max strategy
aggregated_max = {}
for token_idx, word_id in enumerate(word_ids):
    if word_id is None:
        continue
    max_score, max_score_idx = scores[token_idx].max(dim=0)
    if word_id not in aggregated_max:
        aggregated_max[word_id] = {
            'label': id2label[max_score_idx.item()],
            'score': max_score.item()
        }
    else:
        if max_score.item() > aggregated_max[word_id]['score']:
            aggregated_max[word_id] = {
                'label': id2label[max_score_idx.item()],
                'score': max_score.item()
            }

print(f'\nSimple aggregation (first subword):')
for word_id, result in aggregated_simple.items():
    print(f'  Word {word_id}: {result["label"]} (score: {result["score"]:.4f})')

print(f'\nAverage aggregation:')
for word_id, result in aggregated_avg.items():
    print(f'  Word {word_id}: {result["label"]} (score: {result["score"]:.4f})')

print(f'\nMax aggregation:')
for word_id, result in aggregated_max.items():
    print(f'  Word {word_id}: {result["label"]} (score: {result["score"]:.4f})')

Output

Tokens: ['[CLS]', 'John', 'Smith', 'works', 'at', 'Google', 'in', 'California', '.', '[SEP]']
Word IDs: [None, 0, 1, 2, 3, 4, 5, 6, 7, None]

Predictions (token-level): [0, 1, 1, 0, 0, 1, 0, 1, 0, 0]
Prediction labels: ['O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'O', 'B-LOC', 'O', 'O']
Scores shape: torch.Size([10, 9])

Simple aggregation (first subword):
  Word 0: B-PER (score: 0.9987)
  Word 1: I-PER (score: 0.9983)
  Word 2: O (score: 0.9995)
  Word 3: O (score: 0.9988)
  Word 4: B-ORG (score: 0.9991)
  Word 5: O (score: 0.9993)
  Word 6: B-LOC (score: 0.9989)
  Word 7: O (score: 0.9994)

Average aggregation:
  Word 0: B-PER (score: 0.9987)
  Word 1: I-PER (score: 0.9983)
  Word 2: O (score: 0.9995)
  Word 3: O (score: 0.9988)
  Word 4: B-ORG (score: 0.9991)
  Word 5: O (score: 0.9993)
  Word 6: B-LOC (score: 0.9989)
  Word 7: O (score: 0.9994)

Max aggregation:
  Word 0: B-PER (score: 0.9987)
  Word 1: I-PER (score: 0.9983)
  Word 2: O (score: 0.9995)
  Word 3: O (score: 0.9988)
  Word 4: B-ORG (score: 0.9991)
  Word 5: O (score: 0.9993)
  Word 6: B-LOC (score: 0.9989)
  Word 7: O (score: 0.9994)

What just happened?

The code loaded a BERT-based NER model and tokenized a sentence. It generated token-level predictions (one label per token), then used three aggregation strategies to map those predictions back to word-level labels. The `word_ids()` method tracked which token belongs to which word (0=John, 1=Smith, etc.), and each strategy applied a different rule: simple took the first token's label, average combined logits across subword tokens before picking the label, and max selected whichever label had the highest confidence across all subwords for that word. All three strategies produced the same result here because each word was only one token, but they differ when words split into multiple subwords.

Common gotcha

The most common mistake is forgetting that `word_ids()` returns `None` for special tokens like `[CLS]` and `[SEP]`: if you don't skip these with `if word_id is None`, you'll get KeyError or incorrect mappings. Also, averaging across logits is different from averaging across probability scores; always average logits before applying softmax, not after, or your confidence estimates become meaningless.

Error recovery

KeyError when accessing word_id

This happens because `word_ids()` returns `None` for special tokens. Always check `if word_id is None: continue` before using word_id as a dictionary key.

Predictions don't match between strategies

If you average scores (probabilities) instead of logits, the result will differ from simple or max. Always stack and average logits, then apply softmax once at the end.

word_ids() returns None for all positions

You passed `return_tensors='pt'` but forgot to call `.word_ids()` on a single batch item (use `word_ids(0)` or index into the batch first) or didn't encode with a fast tokenizer. Use `AutoTokenizer.from_pretrained()` not `transformers.BertTokenizer` directly.

Score is 0 or NaN after aggregation

You likely indexed the confidence wrong; make sure you're taking `.max()` on the softmax output along the class dimension (dim=0 for class scores, not token dimension).

Experienced dev note

In production NER systems, simple aggregation (first subword) is often faster and just as accurate as averaging or max strategies: the difference shows up mainly on rare or ambiguous words. More importantly, if you're using the `pipeline()` from transformers with `aggregation_strategy='simple'`, it handles all of this internally, but once you're fine-tuning on custom data or need custom post-processing, knowing these mechanics lets you match exactly what the standard pipeline would do, which is crucial for debugging evaluation mismatches. Also, watch out: if your text contains contractions or hyphenated words, word boundaries get tricky: tokenizer behavior varies by model, so always print `word_ids()` first when moving to a new dataset.

Check your understanding

A sentence contains a word that tokenizes into 4 subword tokens, with logits [0.1, 0.8, 0.05] (scores for classes O, B-PER, I-PER). Using the average strategy, what label would that word get, and why would simple strategy possibly differ if one of the subword tokens predicted a different class?

Show answer hint

A correct answer explains that average strategy stacks logits across all 4 tokens, averages them, then applies argmax to pick B-PER. Simple strategy would pick the label of the first subword only, so if the first subword predicted O instead, simple would return O while average could still return B-PER depending on the other three tokens' logits.

VERSION In transformers < 4.30.0, the `word_ids()` method was not available on all tokenizer types. Since 4.30.0 (January 2023), it's standard on all fast tokenizers. In transformers 5.5.x, `AutoTokenizer.from_pretrained()` always returns a fast tokenizer by default, so `word_ids()` is guaranteed available. If you somehow need a slow tokenizer for compatibility, you must pass `use_fast=False` explicitly.

Once you can aggregate token-level predictions, the natural next step is learning how to evaluate your aggregated predictions against gold-standard NER labels using seqeval metrics.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.