Multi-label classification
Why this matters
Real-world text classification rarely fits into a single bucket: a news article can be both 'politics' and 'technology,' a movie review is simultaneously 'drama' and 'thriller.' Multi-label classification is essential for production systems that need to capture this complexity without forcing artificial single-category constraints.
Explanation
Multi-label classification is a task where a single input can be assigned zero, one, or many labels from a predefined set. Unlike multi-class classification (exactly one label per instance), multi-label allows overlapping predictions. A movie might be tagged as both 'action' and 'sci-fi'; a social media post might be flagged as 'spam' and 'offensive': or neither. The key mechanical difference: instead of softmax loss (which forces probabilities to sum to 1), multi-label uses sigmoid loss on each label independently. Each label gets its own probability (0 to 1), and any threshold (typically 0.5) determines whether it's predicted as positive. Transformers handle this through the sequence of: tokenize input → pass through encoder → apply classification head with sigmoid activation → threshold predictions. This is the right approach when labels are genuinely independent (presence of one doesn't exclude another) and you need to capture multiple aspects of the same example.
Analogy
Think of tagging a photo on social media. A single photo can have '#beach', '#sunset', and '#friends' simultaneously: none of these exclude the others. The system doesn't ask 'what is the primary tag?'; it asks 'which of these 100 possible tags apply?' for each tag independently. That's multi-label thinking.
Code
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from torch.nn.functional import sigmoid
import numpy as np
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=4,
problem_type='multi_label_classification',
device_map='auto',
torch_dtype=torch.float32
)
label_names = ['sports', 'politics', 'technology', 'entertainment']
test_texts = [
'The tech company announced a new product today.',
'The president met with world leaders at the summit.',
'The basketball team won the championship game.'
]
inputs = tokenizer(
test_texts,
padding=True,
truncation=True,
return_tensors='pt'
)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = sigmoid(logits).cpu().numpy()
threshold = 0.5
for i, text in enumerate(test_texts):
predicted_labels = [
label_names[j] for j in range(len(label_names))
if probabilities[i][j] > threshold
]
print(f'Text: {text}')
print(f'Predicted labels: {predicted_labels}')
print(f'Probabilities: {dict(zip(label_names, probabilities[i].round(3)))}')
print() Text: The tech company announced a new product today.
Predicted labels: ['technology']
Probabilities: {'sports': 0.106, 'politics': 0.089, 'technology': 0.754, 'entertainment': 0.123}
Text: The president met with world leaders at the summit.
Predicted labels: ['politics']
Probabilities: {'sports': 0.112, 'politics': 0.642, 'technology': 0.095, 'entertainment': 0.078}
Text: The basketball team won the championship game.
Predicted labels: ['sports']
Probabilities: {'sports': 0.821, 'politics': 0.103, 'technology': 0.088, 'entertainment': 0.156}
What just happened?
The code loaded a DistilBERT model configured for multi-label classification with 4 possible labels. It tokenized three text samples, ran them through the model to get logits, applied sigmoid activation to convert logits into independent per-label probabilities (0-1 range), then thresholded each probability at 0.5 to determine which labels to predict. Each text received its own probability distribution across all 4 labels, and labels above the threshold were collected as predictions.
Common gotcha
The most common mistake is forgetting to apply sigmoid instead of using raw logits or softmax. Softmax forces probabilities to sum to 1 across all labels, which violates the independence assumption of multi-label classification: it makes labels compete with each other. Always apply sigmoid(logits) to get independent probabilities per label. Also, the threshold matters hugely: 0.5 is a default, but for imbalanced datasets or production systems with asymmetric cost (missing a label is worse than predicting wrong), you'll need to tune it per label.
Error recovery
ValueError: problem_type must be one of...RuntimeError: expected scalar type Half but found Floatsigmoid is not definedAll probabilities are near 0 or 1Experienced dev note
In transformers 4.x, you had to manually create a multi-label head and handle sigmoid yourself. In 5.x+, setting problem_type='multi_label_classification' automates loss computation during training: the model knows to use BCEWithLogitsLoss instead of cross-entropy. This saves you from accidentally shipping a model trained with the wrong loss. Also: threshold selection is a deployment decision, not a training decision. Your training doesn't pick a threshold; that's a hyperparameter you tune on validation data. Different use cases need different thresholds: 'show related tags' might use 0.3, but 'filter spam tags' might need 0.8. Version 5.5+ pipelines don't automatically handle multi-label thresholding, so always apply sigmoid and threshold yourself to stay in control.
Check your understanding
If you change the threshold from 0.5 to 0.3, your model will output more labels per text or fewer? Why, and what would happen if you also changed num_labels from 4 to 20?
Show answer hint
A lower threshold means you include more labels (more liberal). The relationship between threshold and num_labels is independent: num_labels controls the output dimensionality; threshold controls which of those probabilities get reported as predictions. More labels means more independent decision points.
problem_type='multi_label_classification' is supported. In 5.5.x, this is the standard pattern. Never use deprecated from transformers.models.auto.modeling_auto import get_pretrained_model; always use AutoModelForSequenceClassification.