How to handle missing values in Scikit-learn
SimpleImputer from sklearn.impute to fill missing values with strategies like mean, median, or most frequent. Integrate it into a Pipeline to ensure consistent preprocessing before model training.code_error SimpleImputer to your preprocessing pipeline to automatically handle missing values before model fitting.Why this happens
Missing values in datasets cause errors in Scikit-learn models because most estimators do not accept NaN values. For example, fitting a RandomForestClassifier on data with NaN entries raises a ValueError:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').This happens because Scikit-learn expects complete numeric input for training and prediction.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X = np.array([[1, 2], [3, np.nan], [7, 6]])
y = [0, 1, 0]
model = RandomForestClassifier()
model.fit(X, y) # Raises ValueError due to NaN ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). The fix
Use SimpleImputer to replace missing values with a statistic like the mean. Wrap it in a Pipeline to ensure missing values are handled before model training. This prevents errors and improves model robustness.
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
X = np.array([[1, 2], [3, np.nan], [7, 6]])
y = [0, 1, 0]
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('classifier', RandomForestClassifier())
])
pipeline.fit(X, y)
print("Model trained successfully with imputation.") Model trained successfully with imputation.
Preventing it in production
Always validate input data for missing values before inference. Use pipelines with imputers to automate preprocessing. Consider fallback strategies like default values or alerting if missing data exceeds thresholds. This ensures stable production ML workflows.
Key Takeaways
- Use
SimpleImputerto handle missing values before model training. - Integrate imputation into a
Pipelinefor consistent preprocessing. - Validate and monitor input data in production to avoid runtime errors.