Debug Fix beginner · 3 min read

How to handle missing values in Scikit-learn

Q: How to handle missing values in Scikit-learn

Use SimpleImputer from sklearn.impute to fill missing values with strategies like mean, median, or most frequent. Integrate it into a Pipeline to ensure consistent preprocessing before model training.

Quick answer

Use SimpleImputer from sklearn.impute to fill missing values with strategies like mean, median, or most frequent. Integrate it into a Pipeline to ensure consistent preprocessing before model training.

ERROR TYPE code_error

⚡ QUICK FIX

Add SimpleImputer to your preprocessing pipeline to automatically handle missing values before model fitting.

Why this happens

Missing values in datasets cause errors in Scikit-learn models because most estimators do not accept NaN values. For example, fitting a RandomForestClassifier on data with NaN entries raises a ValueError:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

This happens because Scikit-learn expects complete numeric input for training and prediction.

python

import numpy as np
from sklearn.ensemble import RandomForestClassifier

X = np.array([[1, 2], [3, np.nan], [7, 6]])
y = [0, 1, 0]

model = RandomForestClassifier()
model.fit(X, y)  # Raises ValueError due to NaN

output

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The fix

Use SimpleImputer to replace missing values with a statistic like the mean. Wrap it in a Pipeline to ensure missing values are handled before model training. This prevents errors and improves model robustness.

python

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

X = np.array([[1, 2], [3, np.nan], [7, 6]])
y = [0, 1, 0]

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X, y)
print("Model trained successfully with imputation.")

output

Model trained successfully with imputation.

Preventing it in production

Always validate input data for missing values before inference. Use pipelines with imputers to automate preprocessing. Consider fallback strategies like default values or alerting if missing data exceeds thresholds. This ensures stable production ML workflows.

Related errors

Error	Cause	Quick fix
ValueError: Input contains NaN	Missing values in input data	Use SimpleImputer to fill missing values
TypeError: Cannot convert NaN to int	Integer columns with NaNs	Convert to float or impute before training
Pipeline fails on transform	Imputer not included in pipeline	Add SimpleImputer as first pipeline step

✅

Key Takeaways

Use SimpleImputer to handle missing values before model training.
Integrate imputation into a Pipeline for consistent preprocessing.
Validate and monitor input data in production to avoid runtime errors.

Verified 2026-04

Verify ↗