How to beginner · 4 min read

How to use scikit-learn for text classification

Quick answer
Use scikit-learn to build text classification models by converting text data into numerical features with TfidfVectorizer and training classifiers like LogisticRegression. The pipeline typically involves preprocessing text, vectorizing, training, and evaluating the model.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn>=1.2
  • Basic knowledge of Python and machine learning

Setup

Install scikit-learn and numpy if not already installed. These libraries provide tools for feature extraction and classification.

bash
pip install scikit-learn numpy
output
Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp38-cp38-manylinux1_x86_64.whl (7.1 MB)
Collecting numpy
  Downloading numpy-1.25.0-cp38-cp38-manylinux1_x86_64.whl (17.3 MB)
Installing collected packages: numpy, scikit-learn
Successfully installed numpy-1.25.0 scikit-learn-1.2.2

Step by step

This example shows how to classify text documents into categories using TfidfVectorizer for feature extraction and LogisticRegression as the classifier. It includes training, prediction, and evaluation.

python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# Load dataset
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create a pipeline: vectorizer + classifier
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_df=0.7),
    LogisticRegression(max_iter=1000)
)

# Train the model
model.fit(data_train.data, data_train.target)

# Predict on test data
predicted = model.predict(data_test.data)

# Evaluate
print(classification_report(data_test.target, predicted, target_names=categories))
output
                    precision    recall  f1-score   support

       alt.atheism       0.85      0.82      0.83       319
     comp.graphics       0.87      0.91      0.89       389
           sci.med       0.91      0.91      0.91       396
soc.religion.christian       0.88      0.89      0.88       398

          accuracy                           0.88      1502
         macro avg       0.88      0.88      0.88      1502
      weighted avg       0.88      0.88      0.88      1502

Common variations

  • Use other classifiers like RandomForestClassifier or MultinomialNB for different performance characteristics.
  • Apply CountVectorizer instead of TfidfVectorizer for simple term frequency features.
  • Use GridSearchCV to tune hyperparameters such as max_df or regularization strength.
  • For large datasets, use incremental learning with partial_fit on classifiers like SGDClassifier.
python
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model_nb = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    MultinomialNB()
)
model_nb.fit(data_train.data, data_train.target)
predicted_nb = model_nb.predict(data_test.data)
print(classification_report(data_test.target, predicted_nb, target_names=categories))
output
                    precision    recall  f1-score   support

       alt.atheism       0.83      0.81      0.82       319
     comp.graphics       0.85      0.89      0.87       389
           sci.med       0.90      0.90      0.90       396
soc.religion.christian       0.87      0.87      0.87       398

          accuracy                           0.87      1502
         macro avg       0.86      0.87      0.87      1502
      weighted avg       0.87      0.87      0.87      1502

Troubleshooting

  • If you get ImportError, ensure scikit-learn and numpy are installed and compatible with your Python version.
  • For slow training, reduce max_df or limit the vocabulary size in TfidfVectorizer.
  • If the model overfits, try regularization parameters or use cross-validation.
  • Ensure text data is cleaned if you get poor accuracy, removing noise like HTML tags or special characters.

Key Takeaways

  • Use TfidfVectorizer to convert text into numerical features for classification.
  • Combine vectorization and classification in a scikit-learn pipeline for clean code and easy experimentation.
  • Try different classifiers like LogisticRegression or MultinomialNB based on your dataset and accuracy needs.
  • Tune vectorizer and classifier hyperparameters with tools like GridSearchCV to improve performance.
  • Clean and preprocess text data to enhance model accuracy and reduce noise.
Verified 2026-04 · LogisticRegression, MultinomialNB, TfidfVectorizer
Verify ↗