How to use cross validation in Scikit-learn
Quick answer
Use
cross_val_score or cross_validate from sklearn.model_selection to perform cross validation by splitting your dataset into training and validation folds automatically. These functions evaluate your model on multiple folds and return scores to assess performance reliably.PREREQUISITES
Python 3.8+pip install scikit-learn>=1.2
Setup
Install Scikit-learn if you haven't already. This example uses Python 3.8+ and Scikit-learn 1.2 or newer.
pip install scikit-learn>=1.2 Step by step
This example shows how to use cross_val_score to evaluate a logistic regression model on the Iris dataset with 5-fold cross validation.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Initialize model
model = LogisticRegression(max_iter=200)
# Perform 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy scores: {scores}")
print(f"Mean accuracy: {scores.mean():.3f}") output
Cross-validation accuracy scores: [1. 0.97 0.97 0.97 1. ] Mean accuracy: 0.982
Common variations
- Use
cross_validateto get multiple metrics and fit times. - Change
cvto other splitters likeStratifiedKFoldfor classification. - Use different scoring metrics like
roc_aucorf1_macro.
from sklearn.model_selection import cross_validate, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = ['accuracy', 'f1_macro']
results = cross_validate(model, X, y, cv=cv, scoring=scoring, return_train_score=False)
print(f"Accuracy scores: {results['test_accuracy']}")
print(f"F1 macro scores: {results['test_f1_macro']}") output
Accuracy scores: [1. 0.96666667 0.93333333 0.96666667 1. ] F1 macro scores: [1. 0.96658312 0.93069307 0.96658312 1. ]
Troubleshooting
- If you get convergence warnings with logistic regression, increase
max_iter. - If scores vary widely, try stratified splits or increase
cvfolds. - Ensure your data is shuffled if order matters by setting
shuffle=Truein splitters.
Key Takeaways
- Use
cross_val_scorefor quick model evaluation with cross validation. - Customize cross validation with different splitters and scoring metrics using
cross_validate. - Always check model convergence and data shuffling to ensure reliable results.