How to use scikit-learn for text classification
Quick answer
Use
scikit-learn to build text classification models by converting text data into numerical features with TfidfVectorizer and training classifiers like LogisticRegression. The pipeline typically involves preprocessing text, vectorizing, training, and evaluating the model.PREREQUISITES
Python 3.8+pip install scikit-learn>=1.2Basic knowledge of Python and machine learning
Setup
Install scikit-learn and numpy if not already installed. These libraries provide tools for feature extraction and classification.
pip install scikit-learn numpy output
Collecting scikit-learn Downloading scikit_learn-1.2.2-cp38-cp38-manylinux1_x86_64.whl (7.1 MB) Collecting numpy Downloading numpy-1.25.0-cp38-cp38-manylinux1_x86_64.whl (17.3 MB) Installing collected packages: numpy, scikit-learn Successfully installed numpy-1.25.0 scikit-learn-1.2.2
Step by step
This example shows how to classify text documents into categories using TfidfVectorizer for feature extraction and LogisticRegression as the classifier. It includes training, prediction, and evaluation.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
# Load dataset
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))
# Create a pipeline: vectorizer + classifier
model = make_pipeline(
TfidfVectorizer(stop_words='english', max_df=0.7),
LogisticRegression(max_iter=1000)
)
# Train the model
model.fit(data_train.data, data_train.target)
# Predict on test data
predicted = model.predict(data_test.data)
# Evaluate
print(classification_report(data_test.target, predicted, target_names=categories)) output
precision recall f1-score support
alt.atheism 0.85 0.82 0.83 319
comp.graphics 0.87 0.91 0.89 389
sci.med 0.91 0.91 0.91 396
soc.religion.christian 0.88 0.89 0.88 398
accuracy 0.88 1502
macro avg 0.88 0.88 0.88 1502
weighted avg 0.88 0.88 0.88 1502 Common variations
- Use other classifiers like
RandomForestClassifierorMultinomialNBfor different performance characteristics. - Apply
CountVectorizerinstead ofTfidfVectorizerfor simple term frequency features. - Use
GridSearchCVto tune hyperparameters such asmax_dfor regularization strength. - For large datasets, use incremental learning with
partial_fiton classifiers likeSGDClassifier.
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
model_nb = make_pipeline(
TfidfVectorizer(stop_words='english'),
MultinomialNB()
)
model_nb.fit(data_train.data, data_train.target)
predicted_nb = model_nb.predict(data_test.data)
print(classification_report(data_test.target, predicted_nb, target_names=categories)) output
precision recall f1-score support
alt.atheism 0.83 0.81 0.82 319
comp.graphics 0.85 0.89 0.87 389
sci.med 0.90 0.90 0.90 396
soc.religion.christian 0.87 0.87 0.87 398
accuracy 0.87 1502
macro avg 0.86 0.87 0.87 1502
weighted avg 0.87 0.87 0.87 1502 Troubleshooting
- If you get
ImportError, ensurescikit-learnandnumpyare installed and compatible with your Python version. - For slow training, reduce
max_dfor limit the vocabulary size inTfidfVectorizer. - If the model overfits, try regularization parameters or use cross-validation.
- Ensure text data is cleaned if you get poor accuracy, removing noise like HTML tags or special characters.
Key Takeaways
- Use
TfidfVectorizerto convert text into numerical features for classification. - Combine vectorization and classification in a
scikit-learnpipeline for clean code and easy experimentation. - Try different classifiers like
LogisticRegressionorMultinomialNBbased on your dataset and accuracy needs. - Tune vectorizer and classifier hyperparameters with tools like
GridSearchCVto improve performance. - Clean and preprocess text data to enhance model accuracy and reduce noise.