Text Classification Cheat Sheet — Methods & Implementation G
Assign predefined labels to text inputs using learned or rule-based patterns
Like a postal worker sorting mail into labeled bins: you extract identifying features (address, weight, color) and route each piece to the correct destination based on learned sorting rules.
Key Concepts
Text Classification Comparison
| Approach | Speed | Accuracy | Data Needed | Best For |
|---|
Text Classification Patterns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1, 2))),
('svm', LinearSVC(C=1.0, max_iter=2000, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = (y_pred == y_test).mean() accuracy: 0.87 from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
classifier = pipeline(
'text-classification',
model='distilbert-base-uncased-finetuned-sst-2-english',
device=0 if torch.cuda.is_available() else -1
)
results = classifier([
'This product is amazing!',
'Terrible experience, never again'
])
for r in results:
print(f"{r['label']}: {r['score']:.3f}") POSITIVE: 0.998\nNEGATIVE: 0.995 from transformers import pipeline
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
text = 'The new iPhone 16 features a faster processor.'
candidates = ['product launch', 'tech specs', 'price discussion', 'user experience']
results = classifier(text, candidates)
for r in results:
print(f"{r}: {results[r]:.3f}") tech specs: 0.672\nproduct launch: 0.198\nuser experience: 0.088 from openai import OpenAI
import json
import os
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{
'role': 'system',
'content': 'Classify the article into labels. Return JSON: {"labels": ["label1", "label2"], "confidence": 0.95}'
},
{
'role': 'user',
'content': 'Apple announced a new M4 chip with 10-core CPU. It supports AI inference 5x faster than M3.'
}
]
)
result = json.loads(response.choices[0].message.content)
print(result['labels']) ["Apple", "AI", "Hardware", "Product Announcement"] import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim=100, hidden_dim=128, num_classes=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_classes)
self.dropout = nn.Dropout(0.3)
def forward(self, x):
embedded = self.dropout(self.embedding(x))
_, (hidden, _) = self.lstm(embedded)
hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)
return self.fc(hidden)
model = TextClassifier(vocab_size=10000, num_classes=3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) model ready for training from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
import numpy as np
mlb = MultiLabelBinarizer()
y_train_binary = mlb.fit_transform(y_train) # Convert ['a', 'b'] → [1,1,0,0]
model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100))
model.fit(X_train_vectorized, y_train_binary)
y_pred_proba = np.array([est.predict_proba(X_test_vectorized)[:, 1] for est in model.estimators_]).T
y_pred_binary = (y_pred_proba > 0.5).astype(int) # Threshold at 0.5
y_pred_labels = mlb.inverse_transform(y_pred_binary) [['label_a', 'label_c'], ['label_b']] Common Errors & Fixes
RuntimeError: Expected all tensors to be on the same device Cause: Model on GPU but input tensors on CPU (or vice versa). Transformers pipeline often auto-switches devices unpredictably.
Explicitly move tensors and model to same device:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
inputs = tokenizer(text, return_tensors='pt').to(device)
outputs = model(**inputs) Low accuracy on held-out test set despite high training accuracy Cause: Overfitting: model memorized training labels. Common with small datasets (<1K samples) or large models (BERT).
Add regularization and use early stopping:
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
weight_decay=0.01,
learning_rate=2e-5,
save_strategy='steps',
eval_strategy='steps',
eval_steps=100,
load_best_model_at_end=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train() Memory error when fine-tuning BERT: 'CUDA out of memory' Cause: Full BERT (110M params) + batch_size too large for GPU VRAM (typical: 12GB V100).
Use DistilBERT (40% smaller, 97% performance), gradient accumulation, or mixed precision:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=2
)
training_args = TrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=4, # Effective batch: 32
fp16=True # Mixed precision: 2x memory savings
) Wildly varying predictions across identical texts in zero-shot Cause: BART and similar zero-shot models use softmax over candidates: if candidates are semantically similar, scores are near-random.
Use more discriminative label descriptions and ensure they're semantically distant:
candidates = [
'This article is about politics and elections',
'This article is about sports and athletics',
'This article is about entertainment and celebrities'
]
results = classifier(text, candidates, hypothesis_template='This text is {}.') Confusion between single-label and multi-label during training Cause: Using CrossEntropyLoss (single-label) for multi-label task or BCEWithLogitsLoss (multi-label) for single-label.
Match loss to task type:
# Single-label (one class per sample)
criterion = nn.CrossEntropyLoss()
outputs = model(input_ids) # Shape: [batch, num_classes]
loss = criterion(outputs, labels) # labels: [batch] with int class indices
# Multi-label (multiple classes per sample)
criterion = nn.BCEWithLogitsLoss()
outputs = model(input_ids) # Shape: [batch, num_classes]
loss = criterion(outputs, labels.float()) # labels: [batch, num_classes] with 0/1 Production Gotchas
If you fine-tune with one tokenizer but use a different one at inference (or vice versa), token IDs won't align: prediction quality crashes. Always use the same tokenizer: `AutoTokenizer.from_pretrained(model_name)` where model_name matches your model.
If 95% of data is 'negative' and 5% is 'positive', a naive classifier that predicts 'negative' for everything gets 95% accuracy while completely missing the minority class. Use stratified train/test splits, weighted loss functions (CrossEntropyLoss(weight=...)), or SMOTE for synthetic oversampling.
model.forward() returns raw logits: not normalized probabilities. Use softmax (single-label) or sigmoid (multi-label) to convert to 0-1 range for thresholding or confidence scores.
Classifying 100K documents with GPT-4o = ~$1000. Use cheaper models (gpt-4o-mini: 90% accuracy, 10x cheaper) or batch API (50% discount for 24hr latency), or fine-tune once and deploy locally.
Use precision-recall curves (sklearn.metrics.precision_recall_curve) to find the threshold that maximizes your metric (F1, precision, recall). Multi-label tasks often need per-class thresholds.
If training has 80% 'class_a' and test has 40% 'class_a', accuracy is meaningless. Stratified split ensures equal distribution: `train_test_split(X, y, test_size=0.2, stratify=y)`.