Cheat Sheet intermediate · 12 min read

Text Classification Cheat Sheet — Methods & Implementation G

version 2026.04

Categorize text with rules, ML, or neural networks

Mental model

Assign predefined labels to text inputs using learned or rule-based patterns

Like a postal worker sorting mail into labeled bins: you extract identifying features (address, weight, color) and route each piece to the correct destination based on learned sorting rules.

Key Concepts

Single-label classification

Each text belongs to exactly one category; mutually exclusive outputs (e.g., spam/not-spam).

Multi-label classification

Text can belong to multiple categories simultaneously; non-exclusive outputs (e.g., article tagged [politics, breaking-news, opinion]).

Feature extraction

Converting raw text into numerical representations: bag-of-words, TF-IDF, embeddings, or transformer hidden states.

Tokenization

Splitting text into tokens (words, subwords, characters) that models can process.

Class imbalance

Training data has disproportionate label distribution (e.g., 95% negative, 5% positive): degrades minority class recall.

Zero-shot classification

Classifying text into categories the model never saw during training using natural language descriptions of labels.

Text Classification Comparison

Approach	Speed	Accuracy	Data Needed	Best For

Text Classification Patterns

01 Scikit-learn TF-IDF + Linear SVM

Fast baseline, interpretable, <10K samples

python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
    ('svm', LinearSVC(C=1.0, max_iter=2000, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = (y_pred == y_test).mean()

output accuracy: 0.87

SVM doesn't output probabilities by default; use SVC(probability=True) if you need confidence scores: 10x slower.

02 Hugging Face Fine-tuned Transformer

SOTA accuracy, >500 labeled samples, GPU available

python

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

classifier = pipeline(
    'text-classification',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device=0 if torch.cuda.is_available() else -1
)

results = classifier([
    'This product is amazing!',
    'Terrible experience, never again'
])

for r in results:
    print(f"{r['label']}: {r['score']:.3f}")

output POSITIVE: 0.998\nNEGATIVE: 0.995

Pipeline defaults to CPU: set device=0 for GPU or you'll bottleneck at 5-10 samples/sec. DistilBERT is 40% smaller than BERT; use RoBERTa for +2-3% accuracy.

03 Zero-shot with Transformers Pipeline

No training data, dynamic labels, complex reasoning needed

python

from transformers import pipeline

classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

text = 'The new iPhone 16 features a faster processor.'
candidates = ['product launch', 'tech specs', 'price discussion', 'user experience']

results = classifier(text, candidates)
for r in results:
    print(f"{r}: {results[r]:.3f}")

output tech specs: 0.672\nproduct launch: 0.198\nuser experience: 0.088

BART-large is 1.6GB; slower on CPU (5-10s/call). Zero-shot confidence is often overconfident: never use raw scores as probabilities.

04 OpenAI GPT-4o Multi-label Classification

Few-shot, high complexity, no GPU, reasoning required

python

from openai import OpenAI
import json
import os

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'system',
            'content': 'Classify the article into labels. Return JSON: {"labels": ["label1", "label2"], "confidence": 0.95}'
        },
        {
            'role': 'user',
            'content': 'Apple announced a new M4 chip with 10-core CPU. It supports AI inference 5x faster than M3.'
        }
    ]
)

result = json.loads(response.choices[0].message.content)
print(result['labels'])

output ["Apple", "AI", "Hardware", "Product Announcement"]

LLM cost scales with tokens; 50K texts × $0.01/1K tokens = $500. Use batch API for bulk classification. Temperature=0 reduces variability.

05 PyTorch Custom LSTM Classifier

Custom architecture, domain-specific embeddings, research

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=128, num_classes=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        _, (hidden, _) = self.lstm(embedded)
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
        return self.fc(hidden)

model = TextClassifier(vocab_size=10000, num_classes=3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

output model ready for training

LSTM is slower than transformers (10x); rarely beats fine-tuned BERT. Use for interpretability or when GPU memory is critical.

06 Multi-label with Custom Threshold

Multiple labels per text, need to tune recall/precision tradeoff

python

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np

mlb = MultiLabelBinarizer()
y_train_binary = mlb.fit_transform(y_train)  # Convert ['a', 'b'] → [1,1,0,0]

model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100))
model.fit(X_train_vectorized, y_train_binary)

y_pred_proba = np.array([est.predict_proba(X_test_vectorized)[:, 1] for est in model.estimators_]).T
y_pred_binary = (y_pred_proba > 0.5).astype(int)  # Threshold at 0.5
y_pred_labels = mlb.inverse_transform(y_pred_binary)

output [['label_a', 'label_c'], ['label_b']]

Threshold 0.5 is arbitrary; use precision-recall curves to find optimal threshold per class. Class imbalance kills minority labels: use class_weight='balanced' or SMOTE.

Common Errors & Fixes

01 RuntimeError: Expected all tensors to be on the same device

Cause: Model on GPU but input tensors on CPU (or vice versa). Transformers pipeline often auto-switches devices unpredictably.

Fix:

python

Explicitly move tensors and model to same device:

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
inputs = tokenizer(text, return_tensors='pt').to(device)
outputs = model(**inputs)

02 Low accuracy on held-out test set despite high training accuracy

Cause: Overfitting: model memorized training labels. Common with small datasets (<1K samples) or large models (BERT).

Fix:

python

Add regularization and use early stopping:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    weight_decay=0.01,
    learning_rate=2e-5,
    save_strategy='steps',
    eval_strategy='steps',
    eval_steps=100,
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

03 Memory error when fine-tuning BERT: 'CUDA out of memory'

Cause: Full BERT (110M params) + batch_size too large for GPU VRAM (typical: 12GB V100).

Fix:

python

Use DistilBERT (40% smaller, 97% performance), gradient accumulation, or mixed precision:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=2
)

training_args = TrainingArguments(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,  # Effective batch: 32
    fp16=True  # Mixed precision: 2x memory savings
)

04 Wildly varying predictions across identical texts in zero-shot

Cause: BART and similar zero-shot models use softmax over candidates: if candidates are semantically similar, scores are near-random.

Fix:

python

Use more discriminative label descriptions and ensure they're semantically distant:

candidates = [
    'This article is about politics and elections',
    'This article is about sports and athletics',
    'This article is about entertainment and celebrities'
]
results = classifier(text, candidates, hypothesis_template='This text is {}.')

05 Confusion between single-label and multi-label during training

Cause: Using CrossEntropyLoss (single-label) for multi-label task or BCEWithLogitsLoss (multi-label) for single-label.

Fix:

python

Match loss to task type:

# Single-label (one class per sample)
criterion = nn.CrossEntropyLoss()
outputs = model(input_ids)  # Shape: [batch, num_classes]
loss = criterion(outputs, labels)  # labels: [batch] with int class indices

# Multi-label (multiple classes per sample)
criterion = nn.BCEWithLogitsLoss()
outputs = model(input_ids)  # Shape: [batch, num_classes]
loss = criterion(outputs, labels.float())  # labels: [batch, num_classes] with 0/1

Production Gotchas

⚠ Tokenizer mismatch between training and inference

If you fine-tune with one tokenizer but use a different one at inference (or vice versa), token IDs won't align: prediction quality crashes. Always use the same tokenizer: `AutoTokenizer.from_pretrained(model_name)` where model_name matches your model.

⚠ Class imbalance destroys minority class recall

If 95% of data is 'negative' and 5% is 'positive', a naive classifier that predicts 'negative' for everything gets 95% accuracy while completely missing the minority class. Use stratified train/test splits, weighted loss functions (CrossEntropyLoss(weight=...)), or SMOTE for synthetic oversampling.

⚠ Transformer models output logits, not probabilities

model.forward() returns raw logits: not normalized probabilities. Use softmax (single-label) or sigmoid (multi-label) to convert to 0-1 range for thresholding or confidence scores.

⚠ LLM API costs scale linearly with data size

Classifying 100K documents with GPT-4o = ~$1000. Use cheaper models (gpt-4o-mini: 90% accuracy, 10x cheaper) or batch API (50% discount for 24hr latency), or fine-tune once and deploy locally.

⚠ Threshold 0.5 is almost never optimal for imbalanced classes

Use precision-recall curves (sklearn.metrics.precision_recall_curve) to find the threshold that maximizes your metric (F1, precision, recall). Multi-label tasks often need per-class thresholds.

⚠ Train/test label distribution mismatch breaks everything

If training has 80% 'class_a' and test has 40% 'class_a', accuracy is meaningless. Stratified split ensures equal distribution: `train_test_split(X, y, test_size=0.2, stratify=y)`.

Verified 2026-04 · gpt-4o, gpt-4o-mini

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.