How to beginner · 4 min read

How to use scikit-learn for text classification

Q: How to use scikit-learn for text classification

Use scikit-learn to build text classification models by converting text data into numerical features with TfidfVectorizer and training classifiers like LogisticRegression. The pipeline typically involves preprocessing text, vectorizing, training, and evaluating the model.

Quick answer

Use scikit-learn to build text classification models by converting text data into numerical features with TfidfVectorizer and training classifiers like LogisticRegression. The pipeline typically involves preprocessing text, vectorizing, training, and evaluating the model.

PREREQUISITES

Python 3.8+
pip install scikit-learn>=1.2
Basic knowledge of Python and machine learning

Setup

Install scikit-learn and numpy if not already installed. These libraries provide tools for feature extraction and classification.

bash

pip install scikit-learn numpy

output

Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp38-cp38-manylinux1_x86_64.whl (7.1 MB)
Collecting numpy
  Downloading numpy-1.25.0-cp38-cp38-manylinux1_x86_64.whl (17.3 MB)
Installing collected packages: numpy, scikit-learn
Successfully installed numpy-1.25.0 scikit-learn-1.2.2

Step by step

This example shows how to classify text documents into categories using TfidfVectorizer for feature extraction and LogisticRegression as the classifier. It includes training, prediction, and evaluation.

python

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# Load dataset
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
data_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create a pipeline: vectorizer + classifier
model = make_pipeline(
    TfidfVectorizer(stop_words='english', max_df=0.7),
    LogisticRegression(max_iter=1000)
)

# Train the model
model.fit(data_train.data, data_train.target)

# Predict on test data
predicted = model.predict(data_test.data)

# Evaluate
print(classification_report(data_test.target, predicted, target_names=categories))

output

                    precision    recall  f1-score   support

       alt.atheism       0.85      0.82      0.83       319
     comp.graphics       0.87      0.91      0.89       389
           sci.med       0.91      0.91      0.91       396
soc.religion.christian       0.88      0.89      0.88       398

          accuracy                           0.88      1502
         macro avg       0.88      0.88      0.88      1502
      weighted avg       0.88      0.88      0.88      1502

Common variations

Use other classifiers like RandomForestClassifier or MultinomialNB for different performance characteristics.
Apply CountVectorizer instead of TfidfVectorizer for simple term frequency features.
Use GridSearchCV to tune hyperparameters such as max_df or regularization strength.
For large datasets, use incremental learning with partial_fit on classifiers like SGDClassifier.

python

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

model_nb = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    MultinomialNB()
)
model_nb.fit(data_train.data, data_train.target)
predicted_nb = model_nb.predict(data_test.data)
print(classification_report(data_test.target, predicted_nb, target_names=categories))

output

                    precision    recall  f1-score   support

       alt.atheism       0.83      0.81      0.82       319
     comp.graphics       0.85      0.89      0.87       389
           sci.med       0.90      0.90      0.90       396
soc.religion.christian       0.87      0.87      0.87       398

          accuracy                           0.87      1502
         macro avg       0.86      0.87      0.87      1502
      weighted avg       0.87      0.87      0.87      1502

Troubleshooting

If you get ImportError, ensure scikit-learn and numpy are installed and compatible with your Python version.
For slow training, reduce max_df or limit the vocabulary size in TfidfVectorizer.
If the model overfits, try regularization parameters or use cross-validation.
Ensure text data is cleaned if you get poor accuracy, removing noise like HTML tags or special characters.

Key Takeaways

Use TfidfVectorizer to convert text into numerical features for classification.
Combine vectorization and classification in a scikit-learn pipeline for clean code and easy experimentation.
Try different classifiers like LogisticRegression or MultinomialNB based on your dataset and accuracy needs.
Tune vectorizer and classifier hyperparameters with tools like GridSearchCV to improve performance.
Clean and preprocess text data to enhance model accuracy and reduce noise.

Verified 2026-04 · LogisticRegression, MultinomialNB, TfidfVectorizer

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.