How to beginner · 4 min read

How to use Scikit-learn pipeline

Quick answer
Use the Pipeline class from sklearn.pipeline to chain preprocessing and estimator steps into a single object. This enables streamlined training and prediction with methods like fit and predict, ensuring consistent data transformations.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn>=1.2

Setup

Install Scikit-learn if not already installed. Import necessary modules for building a pipeline including preprocessing and model classes.

bash
pip install scikit-learn>=1.2

Step by step

This example demonstrates creating a Pipeline that standardizes features with StandardScaler and fits a logistic regression model. The pipeline is then used to fit and predict on sample data.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define pipeline steps
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(random_state=42))
])

# Train pipeline
pipeline.fit(X_train, y_train)

# Predict
predictions = pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, predictions)
print(f"Test accuracy: {accuracy:.2f}")
output
Test accuracy: 1.00

Common variations

You can customize pipelines by adding different preprocessing steps like PolynomialFeatures, or use pipelines with grid search for hyperparameter tuning. Pipelines also support feature unions to combine multiple feature extraction methods.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('poly', PolynomialFeatures()),
    ('ridge', Ridge())
])

param_grid = {
    'poly__degree': [2, 3],
    'ridge__alpha': [0.1, 1.0, 10.0]
}

grid = GridSearchCV(pipeline, param_grid, cv=5)

# Example data
import numpy as np
X = np.arange(10).reshape(-1, 1)
y = np.sin(X).ravel()

grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
output
Best params: {'poly__degree': 3, 'ridge__alpha': 0.1}

Troubleshooting

  • If you get a ValueError about incompatible shapes, ensure your input data is a 2D array (e.g., reshape 1D arrays with .reshape(-1, 1)).
  • If pipeline steps fail, verify each step implements fit and transform or predict methods as required.
  • Use pipeline.named_steps to access individual steps for debugging.

Key Takeaways

  • Use Pipeline to chain preprocessing and modeling steps for cleaner code and consistent transformations.
  • Pipelines integrate seamlessly with Scikit-learn tools like GridSearchCV for hyperparameter tuning.
  • Always ensure input data shapes and step compatibility to avoid common errors.
  • Access pipeline steps via named_steps for inspection or modification.
  • Pipelines improve reproducibility and reduce data leakage risks in ML workflows.
Verified 2026-04
Verify ↗