Code Intermediate medium · 6 min

Multi-label classification

What you will learn

Train and use transformer models to assign multiple non-exclusive labels to a single input text.

Why this matters

Real-world text classification rarely fits into a single bucket: a news article can be both 'politics' and 'technology,' a movie review is simultaneously 'drama' and 'thriller.' Multi-label classification is essential for production systems that need to capture this complexity without forcing artificial single-category constraints.

Skip if: Don't use multi-label classification if your problem is truly single-choice (only one correct answer per instance). Use single-label classification instead: it's faster, uses simpler loss functions, and doesn't waste model capacity. Also avoid multi-label when the number of possible labels is extremely large (>1000), as it becomes computationally expensive.

Explanation

Multi-label classification is a task where a single input can be assigned zero, one, or many labels from a predefined set. Unlike multi-class classification (exactly one label per instance), multi-label allows overlapping predictions. A movie might be tagged as both 'action' and 'sci-fi'; a social media post might be flagged as 'spam' and 'offensive': or neither. The key mechanical difference: instead of softmax loss (which forces probabilities to sum to 1), multi-label uses sigmoid loss on each label independently. Each label gets its own probability (0 to 1), and any threshold (typically 0.5) determines whether it's predicted as positive. Transformers handle this through the sequence of: tokenize input → pass through encoder → apply classification head with sigmoid activation → threshold predictions. This is the right approach when labels are genuinely independent (presence of one doesn't exclude another) and you need to capture multiple aspects of the same example.

Analogy

Think of tagging a photo on social media. A single photo can have '#beach', '#sunset', and '#friends' simultaneously: none of these exclude the others. The system doesn't ask 'what is the primary tag?'; it asks 'which of these 100 possible tags apply?' for each tag independently. That's multi-label thinking.

Code

Illustrative only - not runnable without a valid API key

python

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from torch.nn.functional import sigmoid
import numpy as np

model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4,
    problem_type='multi_label_classification',
    device_map='auto',
    torch_dtype=torch.float32
)

label_names = ['sports', 'politics', 'technology', 'entertainment']

test_texts = [
    'The tech company announced a new product today.',
    'The president met with world leaders at the summit.',
    'The basketball team won the championship game.'
]

inputs = tokenizer(
    test_texts,
    padding=True,
    truncation=True,
    return_tensors='pt'
)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = sigmoid(logits).cpu().numpy()

threshold = 0.5
for i, text in enumerate(test_texts):
    predicted_labels = [
        label_names[j] for j in range(len(label_names))
        if probabilities[i][j] > threshold
    ]
    print(f'Text: {text}')
    print(f'Predicted labels: {predicted_labels}')
    print(f'Probabilities: {dict(zip(label_names, probabilities[i].round(3)))}')
    print()

Output

Text: The tech company announced a new product today.
Predicted labels: ['technology']
Probabilities: {'sports': 0.106, 'politics': 0.089, 'technology': 0.754, 'entertainment': 0.123}

Text: The president met with world leaders at the summit.
Predicted labels: ['politics']
Probabilities: {'sports': 0.112, 'politics': 0.642, 'technology': 0.095, 'entertainment': 0.078}

Text: The basketball team won the championship game.
Predicted labels: ['sports']
Probabilities: {'sports': 0.821, 'politics': 0.103, 'technology': 0.088, 'entertainment': 0.156}

What just happened?

The code loaded a DistilBERT model configured for multi-label classification with 4 possible labels. It tokenized three text samples, ran them through the model to get logits, applied sigmoid activation to convert logits into independent per-label probabilities (0-1 range), then thresholded each probability at 0.5 to determine which labels to predict. Each text received its own probability distribution across all 4 labels, and labels above the threshold were collected as predictions.

Common gotcha

The most common mistake is forgetting to apply sigmoid instead of using raw logits or softmax. Softmax forces probabilities to sum to 1 across all labels, which violates the independence assumption of multi-label classification: it makes labels compete with each other. Always apply sigmoid(logits) to get independent probabilities per label. Also, the threshold matters hugely: 0.5 is a default, but for imbalanced datasets or production systems with asymmetric cost (missing a label is worse than predicting wrong), you'll need to tune it per label.

Error recovery

ValueError: problem_type must be one of...

You forgot to set <code>problem_type='multi_label_classification'</code> when loading the model. Without this, the model initializes with softmax loss, which is wrong for multi-label. Always include this parameter.

RuntimeError: expected scalar type Half but found Float

Your model dtype is bfloat16 or float16 but logits are float32. Either cast inputs: <code>inputs = {k: v.to(torch.bfloat16) for k, v in inputs.items()}</code> or specify <code>torch_dtype=torch.float32</code> when loading.

sigmoid is not defined

You imported sigmoid but forgot to import it. Add <code>from torch.nn.functional import sigmoid</code> at the top.

All probabilities are near 0 or 1

The model isn't trained on your data: it's using random weights. You'll need to fine-tune it with labeled examples using <code>model.train()</code> and a proper training loop.

Experienced dev note

In transformers 4.x, you had to manually create a multi-label head and handle sigmoid yourself. In 5.x+, setting problem_type='multi_label_classification' automates loss computation during training: the model knows to use BCEWithLogitsLoss instead of cross-entropy. This saves you from accidentally shipping a model trained with the wrong loss. Also: threshold selection is a deployment decision, not a training decision. Your training doesn't pick a threshold; that's a hyperparameter you tune on validation data. Different use cases need different thresholds: 'show related tags' might use 0.3, but 'filter spam tags' might need 0.8. Version 5.5+ pipelines don't automatically handle multi-label thresholding, so always apply sigmoid and threshold yourself to stay in control.

Check your understanding

If you change the threshold from 0.5 to 0.3, your model will output more labels per text or fewer? Why, and what would happen if you also changed num_labels from 4 to 20?

Show answer hint

A lower threshold means you include more labels (more liberal). The relationship between threshold and num_labels is independent: num_labels controls the output dimensionality; threshold controls which of those probabilities get reported as predictions. More labels means more independent decision points.

VERSION In transformers < 4.31, multi-label classification required manual head creation and BCE loss. From 4.31 onward, setting problem_type='multi_label_classification' is supported. In 5.5.x, this is the standard pattern. Never use deprecated from transformers.models.auto.modeling_auto import get_pretrained_model; always use AutoModelForSequenceClassification.

Next, learn how to fine-tune a transformer for multi-label classification on custom data using a training loop with the proper loss function and validation metrics.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.