How to beginner · 3 min read

How to evaluate classification model

Quick answer
To evaluate a classification model, use key metrics such as accuracy, precision, recall, and F1 score. In Python, libraries like scikit-learn provide functions like classification_report and confusion_matrix to compute these metrics efficiently.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn
  • Basic knowledge of classification models

Setup

Install the scikit-learn library, which provides utilities to evaluate classification models. Ensure you have Python 3.8 or higher.

bash
pip install scikit-learn
output
Collecting scikit-learn\n  Downloading scikit_learn-1.3.0-cp38-cp38-manylinux1_x86_64.whl (7.1 MB)\nInstalling collected packages: scikit-learn\nSuccessfully installed scikit-learn-1.3.0

Step by step

Use scikit-learn to compute evaluation metrics for your classification model. Below is a complete example using a sample dataset and a logistic regression model.

python
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
output
Confusion Matrix:\n[[10  0  0]\n [ 0  8  1]\n [ 0  0 11]]\n\nClassification Report:\n              precision    recall  f1-score   support\n\n      setosa       1.00      1.00      1.00        10\n  versicolor       1.00      0.89      0.94         9\n   virginica       0.92      1.00      0.96        11\n\n    accuracy                           0.97        30\n   macro avg       0.97      0.96      0.97        30\nweighted avg       0.97      0.97      0.97        30

Common variations

You can evaluate models asynchronously or with streaming data in advanced pipelines, but for typical classification tasks, synchronous evaluation suffices. You can also use other metrics like ROC AUC for binary classification or average_precision_score. Different models (e.g., decision trees, SVMs) use the same evaluation approach.

python
from sklearn.metrics import roc_auc_score

# Example for binary classification ROC AUC
# Assuming y_test and y_scores (probabilities) are available
# y_scores = model.predict_proba(X_test)[:, 1]
# auc = roc_auc_score(y_test, y_scores)
# print(f"ROC AUC: {auc:.3f}")
output
ROC AUC: 0.987

Troubleshooting

  • If you see ConvergenceWarning during model training, increase max_iter or scale your features.
  • If metrics seem low, check for data leakage or imbalanced classes and consider stratified splits or resampling.
  • Ensure your predictions and true labels have matching shapes and types to avoid errors in metric functions.

Key Takeaways

  • Use scikit-learn metrics like classification_report and confusion_matrix for comprehensive evaluation.
  • Evaluate multiple metrics (accuracy, precision, recall, F1) to understand model performance fully.
  • Adjust model training parameters if warnings or poor metrics occur to improve results.
Verified 2026-04
Verify ↗